Topic Modeling Bibliography

My peronal bibliography for topic modeling, with some notes about how the paper is useful to me.

Topic modeling pre-LDA

Paper Landauer97

Paper Deerwester90

LSA (or LSI) is a linear topic model based the factorization of the document-word matrix $\boldsymbol X$, where $x_{dw}$ is the count of occurrences of word $w$ in document $d$. The goal is to find a low-rank approximation $\tilde{\boldsymbol X}$ of $\boldsymbol X$ factorizing it into two matrices, one representing the documents, and the other the topics.

Use SVD on $ \boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T$. By selecting only the $K$ largest singular values from $\boldsymbol \Sigma$ and the corresponding vectors in $\boldsymbol U$ and $\boldsymbol V^T$, we get the best rank $K$ approximation of matrix $\boldsymbol X$. Rows of $\boldsymbol U$ represent documents, and rows of $\boldsymbol V^T$ represents the topics. Each document can be expressed as a linear combination of topics.

Paper Hofmann99

Introduced pLSI, a probabilistic topic model based on the following generative process. No priors on $\boldsymbol \theta_d$ and $\boldsymbol \phi_k$.

for document d = 1..D:
  for position i = 1..N in document d:
    Draw a topic z_di from Discrete(theta_d)
    Draw a word w_di from Discrete(phi_z_di)

Paper Ding08

Proved the equivalence between pLSI and NMF, by showing that they both optimize the same objective function. As they are different algorithms, this allow to design an hybrid method alternating between NMF and pLSI, every time jumping out of the local optimum of the other method.


Chronologically, Blei, Ng and Jordan first published [Blei02] presenting LDA in NIPS treating topics $\boldsymbol \phi_k$ as free parameters. Shortly after, Griffiths and Steyvers published [Griffiths02a] and [Griffiths02b] extending this model by adding a symmetric Dirichlet prior on $\boldsymbol \phi_k$. Finally, Blei, Ng and Jordan published an extended version [Blei03] of their first paper in Journal of Machine Learning Research (by far the most cited LDA paper) with a section on having this Dirichlet smoothing on multinomial parameters $\boldsymbol \phi_k$.

Paper Blei02

First first paper for LDA, quite short, not used. See [Blei03].

Paper Griffiths02a

Derive a Gibbs sampler for LDA and introduce Dirichlet prior on topics $\boldsymbol \phi_k$.

Paper Griffiths02b

Almost the same paper as [Griffiths02a].

Paper Blei03

Most cited paper for LDA, extended version of [Blei02].

Paper Griffiths04

Less technical paper showing application of LDA to several datasets.

Paper Heinrich04

Heavily detailed tutorial about LDA, and inference using Gibbs sampling.

Paper Steyvers06

LDA has been around for 3 years, they give an in-depth review and analysis of probabilistic models, full of deep insights. hey propose measure capturing similarity between topics (KL, KLsym, JS, cos, L1, L2), between a set of words and documents, and between words.

Paper Blei12

A short, high-level review on topic models. Not technical.

Paper Hoffman10

Present an online version of the Variational EM algorithm introduced in [Blei03].

Evaluation of topic models

Paper Wei06

As an extrinsic evaluation method of topics, used discovered topics for information retrieval.

Paper Chang09

Shown that surprisingly predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated.

They ran a large scale experiment on the Amazon Turk platform. For each topic, they took the five top words of that topics and added a random sixth word. Then, they presented these lists of six words to people asking them which is the intruder word.

If all the people asked could tell which is the intruder, then we can conclude safely that the topic is good at describing an idea. If on the other hand, many people identified other words as the intruder, it means that they could not see the logic into the association of words, and we can conclude the topic was not good enough.

Paper Wallach09a

Gives tons of methods to compute approximations of the likelihood $p(\boldsymbol w_d | \boldsymbol \Phi, \alpha)$ for one unseen document, which is intractable but needed to evaluate topic models.

Paper Buntine09

Like [Wallach09a], gives method to estimate the likelihood of unseen documents.

Paper AlSumait09

Define measures based on three prototypes of junk and insignificant topics to rank discovered topics according to their significance score. The three junk prototypes are the uniform word-distribution, the empirical corpus word-distribution, and the uniform document-distribution: $$ p(w | \text{topic}) \propto 1 \qquad p(w | \text{topic}) \propto \text{count}(w \text{ in corpus}) \qquad p(d | \text{topic}) \propto 1 $$ Then the topic significance score is based on combinations of dissimilarities (KL divergence, cosine, and correlation) from those three junk prototypes. However

Paper Newman10c

Tries different coherence measure on different dataset to compare them.

Paper Mimno11b

Presents a Baysian methods measuring how well a topics model fits a corpus.

Topic coherence

Paper Newman10a

Introduced the UCI coherence measure $\sum_{i<j} \log \frac{p(w_i, w_j)}{p(w_i)p(w_j)}$ for $w_1$, ..., $w_{10}$ top words (based on PMI). The measure is extrinsic as it uses empirical probabilities from an external corpus such as Wikipedia.

Paper Mimno11a

Introduced the UMass coherence measure $\sum_{i<j} \log \frac{1+D(w_i, w_j)}{D(w_i)}$ for $w_1$, ..., $w_{10}$ top words (intrinsic measure).

Paper Stevens12

Explore computing the two coherence metrics UCI from [Newman10a] and UMass from [Mimno11a] on multiple datasets, for different number of topics. The aggregate results (computing average and entropy). They assume these two are good metrics and use them to compare different topic models: LDA, LSA+SVD, and LSA+NMF.

Interactive LDA

Paper Andr09

Make the discovery of topics semi-supervised where a user repeatedly gives orders on top words of discovered topics: "those words should be in the same topic", "those "rds should not be in the same topic", and "those "rds should be by themselves". Orders are encoded into pair-wise constraints on words: two words have to or can not be in the same topic. Then the model is trained again with a complex new prior encoding the constraints based on Dirichlet Forests.

Paper Hu11

Extended approach from [Andr09] proposing interactive topic modeling (ITM) where we don't have to start over the Gibbs sampler after each human action. Instead, the prior is updated in-place to incorporate the new constraints and the underlying model is changed and seen a starting position for a new Markov chain. Updating the model is done by state ablation; invalidate some topic-word assignments by setting $z = -1$. The counts are decremented accordingly

They explore several strategies of invalidation: invalidates all assignments, only of documents that have any of the terms constraints, only of the terms concerned, or none. After each human actions, the Gibbs sample runs for 30 more iterations before asking for human feedback again. Experiments have been done using Amazon Mechanical Turk.

Misc topic modeling

Paper Pauca04

Reference for successful use of NMF for topic modeling.

Paper Lee99

Reference for successful use of NMF for topic modeling.

Paper Doyle09

Elkan's paper about burstiness.

Paper Wallach09b

Study the effect of different priors on LDA output.

Paper Andr11

Use discovered topics in a search engine, use query expansion (like we do in Squid).

Paper Chang10

Presents a "manual" topic modeling where humans are asked to repeatedly tag documents and cluster these annotations.

Paper Chuang12

Define the distinctiveness $\mathcal{D}(w)$ of a word as the KL divergence between the topic distribution $p(k|w)$ given the word $w$ and the empirical topic distribution $p(k)$ of the corpus. Also, the distinctiveness of a word can be weighted by its frequency, this how we define the saliency $\mathcal{S}(w)$ of a word. $$ \mathcal{D}(w) = \sum_{\text{topic } k} p\left(k|w\right) \log \frac{p(k|w)}{p(k)} = \text{KL}\big(p(k|w) \ \Vert \ p(k)\big) \qquad \mathcal{S}(w) = p(w) \mathcal{D}(w) $$ Also present a new visualization of topics distributions based on a matrix of circles, and a word ordering such that topics span contiguous words.


Paper Campbell66

Reference for argument: given a distribution, the exponential entropy is a measure of the extent, or spread, of the distribution. For eg, measure how much a word is shared across several topics.

Paper Geman84

Introduced Gibbs sampling.

Paper Bottou98

Convergence of online algorihtms, gives the condition $\sum_{t=0}^\infty \rho_t^2 < \infty$ needed to prove the convergence of Online Variational LDA [Hoffman10].

Paper Lee00

Reference for NMF.

Paper Bishop06

Heavy book used a reference on Bayesian methods.

Paper Tzikas08

A step-by-step tutorial on EM algorithm, following closely the Bishop book [Bishop06]. They described the MAP as poor man's Bayesian inference as this is a way of including prior knowledge without having to pay the expensive price of computing the normalizer.

Paper Crump13

On using Amazon Turk for doing experiments. One experiment is about measuring the performance of users when varying the amount of the financial incentive, either \$2 and a bonus up to \$2.5 based on task performance, or \$0.75 with no bonus. Results show that the amount on the incentive does not effect the task performance but does effect the rate at which workers sign up for the task.

Quentin Pleplé
March 2013