Word Distinctiveness And Saliency

In order to find the best informative words of a corpus, [Chuang12] first define word distinctiveness

\[\mathcal{D}(w) = \sum_{k} p\left(k|w\right) \log \frac{p(k|w)}{p(k)} = \text{KL}\big(p(k|w) \ \Vert \ p(k)\big)\]

of a word as the KullbackÔÇôLeibler (KL) divergence between, the topic distribution $p(k|w)$ given the word $w$, and the marginal topic distribution $p(k)$, the likelihood that any random word has been drawn from topic $k$. The word distinctiveness measures how much a word is shared across topics. The higher the distinctiveness, the less this word is shared across topics.

Then, they define the word saliency

\[\mathcal{S}(w) = p(w) \mathcal{D}(w)\]

of a word $w$ by weighting its frequency by its distinctiveness. Compared to the ranking by frequency $p(w)$, the ranking by saliency $p(w) \mathcal D(w)$ will penalize the words shared across several topics, as they will have a low distinctiveness, and boost words that are good predictors of one topic, as they will have a high distinctiveness.