Perplexity To Evaluate Topic Models
May 15th, 2013The most common way to evaluate a probabilistic model is to measure the log-likelihood of a held-out test set.
This is usually done by splitting the dataset into two parts: one for training, the other for testing. For LDA, a test set is a collection of unseen documents
of a set of unseen documents
The measure traditionally used for topic models is the perplexity of held-out documents
which is a decreasing function of the log-likelihood
However,
the likelihood
Perplexity is not strongly correlated to human judgment
[Chang09] have shown that, surprisingly, predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated.
They ran a large scale experiment on the Amazon Mechanical Turk platform.
For each topic, they took the top five words (ordered by frequency
If every participant could identify the intruder, then we could conclude that the topic is good at describing an idea. If on the other hand, many people identified one of the topic top five word as the intruder, it means that they could not see the logic in the association of words, and we can conclude the topic was not good enough.
It’s important to understand what this experiment is proving.
The result proves that, given a topic, the five words that have the largest frequency