Word Distinctiveness And Saliency

In order to find the best informative words of a corpus, [Chuang12] first define word distinctiveness $$ \mathcal{D}(w) = \sum_{k} p\left(k|w\right) \log \frac{p(k|w)}{p(k)} = \text{KL}\big(p(k|w) \ \Vert \ p(k)\big) $$ of a word as the Kullback–Leibler (KL) divergence between, the topic distribution $p(k|w)$ given the word $w$, and the marginal topic distribution $p(k)$, the likelihood that any random word has been drawn from topic $k$. The word distinctiveness measures how much a word is shared across topics. The higher the distinctiveness, the less this word is shared across topics.

Then, they define the word saliency $$ \mathcal{S}(w) = p(w) \mathcal{D}(w) $$ of a word $w$ by weighting its frequency by its distinctiveness. Compared to the ranking by frequency $p(w)$, the ranking by saliency $p(w) \mathcal D(w)$ will penalize the words shared across several topics, as they will have a low distinctiveness, and boost words that are good predictors of one topic, as they will have a high distinctiveness.

Quentin Pleplé
November 2013