Topic Modeling Bibliography

🦖 This post was published in 2013 and is most likely outdated.

My personal bibliography for topic modeling, with some notes about how the paper is useful to me.

Topic modeling pre-LDA

Deerwester90 Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.

LSA (or LSI) is a linear topic model based the factorization of the document-word matrix $\boldsymbol X$, where $x_{dw}$ is the count of occurrences of word $w$ in document $d$. The goal is to find a low-rank approximation $\tilde{\boldsymbol X}$ of $\boldsymbol X$ factorizing it into two matrices, one representing the documents, and the other the topics.

Use SVD on $ \boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T$. By selecting only the $K$ largest singular values from $\boldsymbol \Sigma$ and the corresponding vectors in $\boldsymbol U$ and $\boldsymbol V^T$, we get the best rank $K$ approximation of matrix $\boldsymbol X$. Rows of $\boldsymbol U$ represent documents, and rows of $\boldsymbol V^T$ represents the topics. Each document can be expressed as a linear combination of topics.

Hofmann99 Hofmann, T. (1999). Probilistic latent semantic analysis. In UAI.

Introduced pLSI, a probabilistic topic model based on the following generative process. No priors on $\boldsymbol \theta_d$ and $\boldsymbol \varphi_k$.

for document d = 1..D:
  for position i = 1..N in document d:
    Draw a topic z_di from Discrete(theta_d)
    Draw a word w_di from Discrete(phi_z_di)

Ding08 Ding, C., T. Li, and W. Peng (2008). On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis 52, 3913–3927.

Proved the equivalence between pLSI and NMF, by showing that they both optimize the same objective function. As they are different algorithms, this allow to design an hybrid method alternating between NMF and pLSI, every time jumping out of the local optimum of the other method.


Chronologically, Blei, Ng and Jordan first published [Blei02] presenting LDA in NIPS treating topics $\boldsymbol \varphi_k$ as free parameters. Shortly after, Griffiths and Steyvers published [Griffiths02a] and [Griffiths02b] extending this model by adding a symmetric Dirichlet prior on $\boldsymbol \varphi_k$. Finally, Blei, Ng and Jordan published an extended version [Blei03] of their first paper in Journal of Machine Learning Research (by far the most cited LDA paper) with a section on having this Dirichlet smoothing on multinomial parameters $\boldsymbol \varphi_k$.

Blei02 Blei, D. M., A. Y. Ng, and M. I. Jordan (2002). Latent dirichlet allocation. In NIPS.

First first paper for LDA, quite short, not used. See [Blei03].

Griffiths02a Griffiths, T. and M. Steyvers (2002a). A probabilistic approach to semantic representation. In Proceedings of the 24th Annual Conference of the Cognitive Science Society.

Derive a Gibbs sampler for LDA and introduce Dirichlet prior on topics $\boldsymbol \varphi_k$.

Griffiths02b Griffiths, T. L. and M. Steyvers (2002b). Prediction and semantic association. In NIPS, pp. 11–18.

Almost the same paper as [Griffiths02a].

Blei03 Blei, D. M., A. Y. Ng, and M. I. Jordan (2003, March). Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022.

Most cited paper for LDA, extended version of [Blei02].

Steyvers06 Steyvers, M. and T. Griffiths (2006). Probabilistic topic models. In T. Landauer, D. Mcna- mara, S. Dennis, and W. Kintsch (Eds.), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum.

LDA has been around for 3 years, they give an in-depth review and analysis of probabilistic models, full of deep insights. hey propose measure capturing similarity between topics (KL, KLsym, JS, cos, L1, L2), between a set of words and documents, and between words.

Blei12 Blei, D. M. (2012, April). Probabilistic topic models. Commun. ACM 55(4), 77–84.

A short, high-level review on topic models. Not technical.

Hoffman10 Hoffman, M., Bach, F., & Blei, D. (2010). Online learning for latent dirichlet allocation. advances in neural information processing systems, 23.

Present an online version of the Variational EM algorithm introduced in [Blei03].

Evaluation of topic models

Wei06 Wei, X. and B. Croft (2006). Lda-based document models for ad-hoc retrieval. In SIGIR.

As an extrinsic evaluation method of topics, used discovered topics for information retrieval.

Chang09 Chang, J., J. Boyd-Graber, C. Wang, S. Gerrish, and D. M. Blei (2009). Reading tea leaves: How humans interpret topic models. In NIPS.

Shown that surprisingly predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated.

They ran a large scale experiment on the Amazon Turk platform. For each topic, they took the five top words of that topics and added a random sixth word. Then, they presented these lists of six words to people asking them which is the intruder word.

If all the people asked could tell which is the intruder, then we can conclude safely that the topic is good at describing an idea. If on the other hand, many people identified other words as the intruder, it means that they could not see the logic into the association of words, and we can conclude the topic was not good enough.

Wallach09a Wallach, H. M., I. Murray, R. Salakhutdinov, and D. Mimno (2009). Evaluation methods for topic models. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 1105–1112. ACM.

Gives tons of methods to compute approximations of the likelihood $p(\boldsymbol w_d | \boldsymbol \Phi, \alpha)$ for one unseen document, which is intractable but needed to evaluate topic models.

AlSumait09 AlSumait, L., D. Barbara, J. Gentle, and C. Domeniconi (2009). Topic significance ranking of lda generative models. In ECML.

Define measures based on three prototypes of junk and insignificant topics to rank discovered topics according to their significance score. The three junk prototypes are the uniform word-distribution, the empirical corpus word-distribution, and the uniform document-distribution:

\[p(w | \text{topic}) \propto 1 \qquad p(w | \text{topic}) \propto \text{count}(w \text{ in corpus}) \qquad p(d | \text{topic}) \propto 1\]

Then the topic significance score is based on combinations of dissimilarities (KL divergence, cosine, and correlation) from those three junk prototypes. However

Topic coherence

Newman10a Newman, D., Y. Noh, E. Talley, S. Karimi, and T. Baldwin (2010). Evaluating topic models for digital libraries. In Proceedings of the 10th annual joint conference on Digital libraries, New York, NY, USA, pp. 215–224. ACM.

Introduced the UCI coherence measure $\sum_{i<j} \log \frac{p(w_i, w_j)}{p(w_i)p(w_j)}$ for $w_1$, …, $w_{10}$ top words (based on PMI). The measure is extrinsic as it uses empirical probabilities from an external corpus such as Wikipedia.

Mimno11a Mimno, D., H. Wallach, E. Talley, M. Leenders, and A. McCallum (2011). Optimizing

semantic coherence in topic models. In EMNLP.

Introduced the UMass coherence measure $\sum_{i<j} \log \frac{1+D(w_i, w_j)}{D(w_i)}$ for $w_1$, …, $w_{10}$ top words (intrinsic measure).

Stevens12 Stevens, K., P. Kegelmeyer, D. Andrzejewski, and D. Buttler (2012). Exploring topic coher- ence over many models and many topics. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12, Stroudsburg, PA, USA, pp. 952–961. Association for Computational Linguistics.

Explore computing the two coherence metrics UCI from [Newman10a] and UMass from [Mimno11a] on multiple datasets, for different number of topics. The aggregate results (computing average and entropy). They assume these two are good metrics and use them to compare different topic models: LDA, LSA+SVD, and LSA+NMF.

Interactive LDA

Andr09 Andrzejewski, D., X. Zhu, and M. Craven (2009). Incorporating domain knowledge into topic modeling via dirichlet forest priors. In ICML, pp. 25–32.

Make the discovery of topics semi-supervised where a user repeatedly gives orders on top words of discovered topics: “those words should be in the same topic”, “those words should not be in the same topic”, and “those words should be by themselves”. Orders are encoded into pair-wise constraints on words: two words have to or can not be in the same topic. Then the model is trained again with a complex new prior encoding the constraints based on Dirichlet Forests.

Hu11 Hu, Y., J. Boyd-Graber, and B. Satinoff (2011). Interactive topic modeling. In Association

for Computational Linguistics.

Extended approach from [Andr09] proposing interactive topic modeling (ITM) where we don’t have to start over the Gibbs sampler after each human action. Instead, the prior is updated in-place to incorporate the new constraints and the underlying model is changed and seen a starting position for a new Markov chain. Updating the model is done by state ablation; invalidate some topic-word assignments by setting $z = -1$. The counts are decremented accordingly

They explore several strategies of invalidation: invalidates all assignments, only of documents that have any of the terms constraints, only of the terms concerned, or none. After each human actions, the Gibbs sample runs for 30 more iterations before asking for human feedback again. Experiments have been done using Amazon Mechanical Turk.

Misc topic modeling

Pauca04 Pauca, V. P., F. Shahnaz, M. W. Berry, and R. J. Plemmons (2004). Text mining using

non-negative matrix factorizations. In SDM.

Reference for successful use of NMF for topic modeling.

Lee99 Lee, D. D. and H. S. Seung (1999). Learning the parts of objects by non-negative matrix

factorization. Nature 401(6755).

Reference for successful use of NMF for topic modeling.

Wallach09b Wallach, H. M., I. Murray, R. Salakhutdinov, and D. Mimno (2009). Evaluation methods for topic models. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 1105–1112. ACM.

Study the effect of different priors on LDA output.

Chuang12 Chuang, J., C. D. Manning, and J. Heer (2012). Termite: Visualization techniques for assessing textual topic models. In Advanced Visual Interfaces.

Define the distinctiveness $\mathcal{D}(w)$ of a word as the KL divergence between the topic distribution $p(k|w)$ given the word $w$ and the empirical topic distribution $p(k)$ of the corpus. Also, the distinctiveness of a word can be weighted by its frequency, this how we define the saliency $\mathcal{S}(w)$ of a word.

\[\mathcal{D}(w) = \sum_{\text{topic } k} p\left(k|w\right) \log \frac{p(k|w)}{p(k)} = \text{KL}\big(p(k|w) \ \Vert \ p(k)\big) \qquad \mathcal{S}(w) = p(w) \mathcal{D}(w)\]

Also present a new visualization of topics distributions based on a matrix of circles, and a word ordering such that topics span contiguous words.


Campbell66 Campbell, L. L. (1966). Exponential entropy as a measure of extent of a distribution. In Zeitschrift fu ̈r Wahrscheinlichkeitstheorie und Verwandte Gebiete, Volume 5.

Reference for argument: given a distribution, the exponential entropy is a measure of the extent, or spread, of the distribution. For eg, measure how much a word is shared across several topics.

Geman84 Geman, S. and D. Geman (1984, November). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6(6), 721–741.

Introduced Gibbs sampling.

Bishop06 Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc.

Heavy book used a reference on Bayesian methods.

Tzikas08 Tzikas, D., A. Likas, and N. Galatsanos (2008, November). The variational approximation for Bayesian inference. IEEE Signal Processing Magazine 25(6), 131–146.

A step-by-step tutorial on EM algorithm, following closely the Bishop book [Bishop06]. They described the MAP as poor man’s Bayesian inference as this is a way of including prior knowledge without having to pay the expensive price of computing the normalizer.

Crump13 Crump, M. J. C., J. V. McDonnell, and T. M. Gureckis (2013, March). Evaluating Amazon’s Mechanical Turk as a Tool for Experimental Behavioral Research. PLoS ONE 8(3), e57410.

On using Amazon Turk for doing experiments. One experiment is about measuring the performance of users when varying the amount of the financial incentive, either $2 and a bonus up to $2.5 based on task performance, or $0.75 with no bonus. Results show that the amount on the incentive does not effect the task performance but does effect the rate at which workers sign up for the task.