Using PMI For Topic Relevance

One can ask if it is always good thing to increase the PMI score. Indeed, the major issue with PMI is that it over-estimates low-frequency events. Therefore, a high PMI may not mean a high word correlation, but maybe just low-frequency words.

Example

For instance, the PMI is maximal when $w_i$ and $w_j$ always occur together:

\[D(w_i) = D(w_j) = D(w_i, w_j) = n\]

where $n$ is the count of documents where their occur. Then their PMI will be

\[\text{PMI}(w_i, w_j) = \log \frac{\frac n D}{\frac n D \cdot \frac n D} = \log D - \log n\]

where $D$ is the total number of documents. So for the same high predictive power of one word given the other, if they are present in all documents, their PMI will be zero, but it will be $\log D$ if they are appear in only one document.

Alternatives

Some alternatives have been developed to go around this issue. One is to use variants of the PMI such as the Weighted PMI [Schneider05]:

\[p(w_i, w_j) \log \frac{p(w_i, w_j)}{p(w_i)p(w_j)}\]

or giving more weight to the joint probability:

\[\log \frac{p(w_i, w_j)^2}{p(w_i)p(w_j)} \qquad \text{or} \qquad \log \frac{p(w_i, w_j)^3}{p(w_i)p(w_j)}\]

But the most common common alternative is to heuristically choose a threshold on frequencies and don’t consider low-frequency events [Pantel02].

PMI for top words

In our case, we are actually not computing the PMI for every pairs of words but just for the top ten words, either according to frequency or topic relevance. This is close to setting a threshold on frequency: we are only considering frequent words. This argument is true also for word relevance, even if we penalize some high-frequency words (the background words), a word scoring high on relevance $p(w|k) e^{-H_w}$ will have a high frequency $p(w|k)$ as well.

So by working only with high-frequency words, high PMI is more likely to mean high correlation rather than rarer events.

Comments