The most common way to evaluate a probabilistic model is to measure the log-likelihood of a held-out test set.

This is usually done by splitting the dataset into two parts: one for training, the other for testing. For LDA, a test set is a collection of unseen documents $\boldsymbol w_d$, and the model is described by the topic matrix $\boldsymbol \Phi$ and the hyperparameter $\alpha$ for topic-distribution of documents. The LDA parameters $\boldsymbol \Theta$ is not taken into consideration as it represents the topic-distributions for the documents of the training set, and can therefore be ignored to compute the likelihood of unseen documents. Therefore, we need to evaluate the log-likelihood $$ \mathcal L (\boldsymbol w) = \log p(\boldsymbol w | \boldsymbol \Phi, \alpha) = \sum_d \log p(\boldsymbol w_d | \boldsymbol \Phi, \alpha). $$ of a set of unseen documents $\boldsymbol w_d$ given the topics $\boldsymbol \Phi$ and the hyperparameter $\alpha$ for topic-distribution $\boldsymbol \theta_d$ of documents. Likelihood of unseen documents can be used to compare models; higher likelihood implying a better model.

The measure traditionally used for topic models is the \textit{perplexity} of held-out documents $\boldsymbol w_d$ defined as $$ \text{perplexity}(\text{test set } \boldsymbol w) = \exp \left\{ - \frac{\mathcal L(\boldsymbol w)}{\text{count of tokens}} \right\} $$ which is a decreasing function of the log-likelihood $\mathcal L(\boldsymbol w)$ of the unseen documents $\boldsymbol w_d$; the lower the perplexity, the better the model.

However, the likelihood $p(\boldsymbol w_d | \boldsymbol \Phi, \alpha)$ of one document is intractable, which makes the evaluation of $\mathcal L(\boldsymbol w)$, and therefore the perplexity, intractable as well. [Wallach09a] derive various sampling methods to approximate this probability.

[Chang09] have shown that, surprisingly, predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated.

They ran a large scale experiment on the Amazon Mechanical Turk platform. For each topic, they took the top five words (ordered by frequency $p(w|k) = \phi_{kw}$) of that topic and added a random sixth word. Then, they presented these lists of six words to participants asking them to identify the intruder word.

If every participant could identify the intruder, then we could conclude that the topic is good at describing an idea. If on the other hand, many people identified one of the topic top five word as the intruder, it means that they could not see the logic in the association of words, and we can conclude the topic was not good enough.

It's important to understand what this experiment is proving. The result proves that, given a topic, the five words that have the largest frequency $p(w|k) = \phi_{kw}$ withing their topic are usually not good at describing one coherent idea; at least not good enough to be able to recognize an intruder.