On Smoothing In Topic Coherence Measures

[Stevens12] performed an extensive study of the these two measures on one dataset, New York Times articles (93k New York Times articles from 2003 and a vocabulary size of 36k tokens after removing the ones occurring less than 200 times throughout the corpus), in order to compare the three models: LDA, LSA with SVD, and LSA with NMF.

However, they did not use the formulations of the measures used by their original authors.

UMass measure

For the UMass measure, they introduced a free parameter $\varepsilon$, instead of just one, for smoothing in the pairwise scoring function

\[\text{score}_{\text{UMass}}(w_i, w_j) = \log \frac{D(w_i, w_j) + \varepsilon}{D(w_i)}\]

and tried both $\varepsilon = 1$ and $\varepsilon = 10^{-12}$. Setting $\varepsilon = 10^{-12}$ seems to over-penalize pairs that never occur together, i.e. when $D(w_i, w_j) = 0$, as it will decrease the score of that pair by 12

\[\text{score}_{\text{UMass}}(w_i, w_j) = \log \frac{\varepsilon}{D(w_i)} = -12 - \log D(w_i)\]

which is very large in the log space of document counts. It is also equivalent to say that we would have to see $10^{12}$ more documents to see the two words appearing only once together. Having $\varepsilon = 1$ looks more reasonable as it is treating pairs that appear once throughout the corpus, i.e. when $D(w_i, w_j) = 1$, and pairs that never appear, i.e. when $D(w_i, w_j) = 0$, roughly in the same order of magnitude.

UCI measure

For the UCI measure, they introduced a smoothing $\varepsilon$ in the pairwise score that the original authors did not

\[\text{score}_{\text{UCI}}(w_i, w_j) = \log \frac{p(w_i, w_j) + \varepsilon}{p(w_i)p(w_j)}\]

with $\varepsilon$ initially set to one. Here, we are at a different scale as we are dealing with probabilities and not counts. As $\varepsilon = 1$ is likely to be huge compared to $p(w_i, w_j)$, the smoothing parameter artificially increases the topic coherence, and not even by the same amount for all pairs of words. Then, using $\varepsilon = 10^{-12}$, which is likely to be smaller than $p(w_i, w_j)$, caused big changes, and therefore the authors concluded that coherence measures depend heavily on smoothing $\varepsilon$.

Nonetheless a smoothing parameter is required to avoid taking the logarithm of zero. A reasonable choice for smoothing is to assume that every pair of words is present at least once in the corpus, and compute the empirical probability

\[p(w_i, w_j) = \frac{D_{\text{Wikipedia}}(w_i, w_j) + 1}{D_{\text{Wikipedia}}}\]

which is the same smoothing as the UMass measure.