Topic Modeling Bibliography
15 Mar 2013My peronal bibliography for topic modeling, with some notes about how the paper is useful to me.
Topic modeling pre-LDA
Paper Landauer97
Paper Deerwester90
LSA (or LSI) is a linear topic model based the factorization of the document-word matrix $\boldsymbol X$, where $x_{dw}$ is the count of occurrences of word $w$ in document $d$. The goal is to find a low-rank approximation $\tilde{\boldsymbol X}$ of $\boldsymbol X$ factorizing it into two matrices, one representing the documents, and the other the topics.
Use SVD on $ \boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T$. By selecting only the $K$ largest singular values from $\boldsymbol \Sigma$ and the corresponding vectors in $\boldsymbol U$ and $\boldsymbol V^T$, we get the best rank $K$ approximation of matrix $\boldsymbol X$. Rows of $\boldsymbol U$ represent documents, and rows of $\boldsymbol V^T$ represents the topics. Each document can be expressed as a linear combination of topics.
Paper Hofmann99
Introduced pLSI, a probabilistic topic model based on the following generative process. No priors on $\boldsymbol \theta_d$ and $\boldsymbol \varphi_k$.
for document d = 1..D:
for position i = 1..N in document d:
Draw a topic z_di from Discrete(theta_d)
Draw a word w_di from Discrete(phi_z_di)
Paper Ding08
Proved the equivalence between pLSI and NMF, by showing that they both optimize the same objective function. As they are different algorithms, this allow to design an hybrid method alternating between NMF and pLSI, every time jumping out of the local optimum of the other method.
LDA
Chronologically, Blei, Ng and Jordan first published [Blei02] presenting LDA in NIPS treating topics $\boldsymbol \varphi_k$ as free parameters. Shortly after, Griffiths and Steyvers published [Griffiths02a] and [Griffiths02b] extending this model by adding a symmetric Dirichlet prior on $\boldsymbol \varphi_k$. Finally, Blei, Ng and Jordan published an extended version [Blei03] of their first paper in Journal of Machine Learning Research (by far the most cited LDA paper) with a section on having this Dirichlet smoothing on multinomial parameters $\boldsymbol \varphi_k$.
Paper Blei02
First first paper for LDA, quite short, not used. See [Blei03].
Paper Griffiths02a
Derive a Gibbs sampler for LDA and introduce Dirichlet prior on topics $\boldsymbol \varphi_k$.
Paper Griffiths02b
Almost the same paper as [Griffiths02a].
Paper Blei03
Most cited paper for LDA, extended version of [Blei02].
Paper Griffiths04
Less technical paper showing application of LDA to several datasets.
Paper Heinrich04
Heavily detailed tutorial about LDA, and inference using Gibbs sampling.
Paper Steyvers06
LDA has been around for 3 years, they give an in-depth review and analysis of probabilistic models, full of deep insights. hey propose measure capturing similarity between topics (KL, KLsym, JS, cos, L1, L2), between a set of words and documents, and between words.
Paper Blei12
A short, high-level review on topic models. Not technical.
Paper Hoffman10
Present an online version of the Variational EM algorithm introduced in [Blei03].
Evaluation of topic models
Paper Wei06
As an extrinsic evaluation method of topics, used discovered topics for information retrieval.
Paper Chang09
Shown that surprisingly predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated.
They ran a large scale experiment on the Amazon Turk platform. For each topic, they took the five top words of that topics and added a random sixth word. Then, they presented these lists of six words to people asking them which is the intruder word.
If all the people asked could tell which is the intruder, then we can conclude safely that the topic is good at describing an idea. If on the other hand, many people identified other words as the intruder, it means that they could not see the logic into the association of words, and we can conclude the topic was not good enough.
Paper Wallach09a
Gives tons of methods to compute approximations of the likelihood $p(\boldsymbol w_d | \boldsymbol \Phi, \alpha)$ for one unseen document, which is intractable but needed to evaluate topic models. |
Paper Buntine09
Like [Wallach09a], gives method to estimate the likelihood of unseen documents.
Paper AlSumait09
Define measures based on three prototypes of junk and insignificant topics to rank discovered topics according to their significance score. The three junk prototypes are the uniform word-distribution, the empirical corpus word-distribution, and the uniform document-distribution: \(p(w | \text{topic}) \propto 1 \qquad p(w | \text{topic}) \propto \text{count}(w \text{ in corpus}) \qquad p(d | \text{topic}) \propto 1\) Then the topic significance score is based on combinations of dissimilarities (KL divergence, cosine, and correlation) from those three junk prototypes. However
Paper Newman10c
Tries different coherence measure on different dataset to compare them.
Paper Mimno11b
Presents a Baysian methods measuring how well a topics model fits a corpus.
Topic coherence
Paper Newman10a
Introduced the UCI coherence measure $\sum_{i<j} \log \frac{p(w_i, w_j)}{p(w_i)p(w_j)}$ for $w_1$, …, $w_{10}$ top words (based on PMI). The measure is extrinsic as it uses empirical probabilities from an external corpus such as Wikipedia.
Paper Mimno11a
Introduced the UMass coherence measure $\sum_{i<j} \log \frac{1+D(w_i, w_j)}{D(w_i)}$ for $w_1$, …, $w_{10}$ top words (intrinsic measure).
Paper Stevens12
Explore computing the two coherence metrics UCI from [Newman10a] and UMass from [Mimno11a] on multiple datasets, for different number of topics. The aggregate results (computing average and entropy). They assume these two are good metrics and use them to compare different topic models: LDA, LSA+SVD, and LSA+NMF.
Interactive LDA
Paper Andr09
Make the discovery of topics semi-supervised where a user repeatedly gives orders on top words of discovered topics: “those words should be in the same topic”, “those “rds should not be in the same topic”, and “those “rds should be by themselves”. Orders are encoded into pair-wise constraints on words: two words have to or can not be in the same topic. Then the model is trained again with a complex new prior encoding the constraints based on Dirichlet Forests.
Paper Hu11
Extended approach from [Andr09] proposing interactive topic modeling (ITM) where we don’t have to start over the Gibbs sampler after each human action. Instead, the prior is updated in-place to incorporate the new constraints and the underlying model is changed and seen a starting position for a new Markov chain. Updating the model is done by state ablation; invalidate some topic-word assignments by setting $z = -1$. The counts are decremented accordingly
They explore several strategies of invalidation: invalidates all assignments, only of documents that have any of the terms constraints, only of the terms concerned, or none. After each human actions, the Gibbs sample runs for 30 more iterations before asking for human feedback again. Experiments have been done using Amazon Mechanical Turk.
Misc topic modeling
Paper Pauca04
Reference for successful use of NMF for topic modeling.
Paper Lee99
Reference for successful use of NMF for topic modeling.
Paper Doyle09
Elkan’s paper about burstiness.
Paper Wallach09b
Study the effect of different priors on LDA output.
Paper Andr11
Use discovered topics in a search engine, use query expansion (like we do in Squid).
Paper Chang10
Presents a “manual” topic modeling where humans are asked to repeatedly tag documents and cluster these annotations.
Paper Chuang12
Define the distinctiveness $\mathcal{D}(w)$ of a word as the KL divergence between the topic distribution $p(k|w)$ given the word $w$ and the empirical topic distribution $p(k)$ of the corpus. Also, the distinctiveness of a word can be weighted by its frequency, this how we define the saliency $\mathcal{S}(w)$ of a word. \(\mathcal{D}(w) = \sum\_{\text{topic } k} p\left(k|w\right) \log \frac{p(k|w)}{p(k)} = \text{KL}\big(p(k|w) \ \Vert \ p(k)\big) \qquad \mathcal{S}(w) = p(w) \mathcal{D}(w)\) Also present a new visualization of topics distributions based on a matrix of circles, and a word ordering such that topics span contiguous words.
Misc
Paper Campbell66
Reference for argument: given a distribution, the exponential entropy is a measure of the extent, or spread, of the distribution. For eg, measure how much a word is shared across several topics.
Paper Geman84
Introduced Gibbs sampling.
Paper Bottou98
Convergence of online algorihtms, gives the condition $\sum_{t=0}^\infty \rho_t^2 < \infty$ needed to prove the convergence of Online Variational LDA [Hoffman10].
Paper Lee00
Reference for NMF.
Paper Bishop06
Heavy book used a reference on Bayesian methods.
Paper Tzikas08
A step-by-step tutorial on EM algorithm, following closely the Bishop book [Bishop06]. They described the MAP as poor man’s Bayesian inference as this is a way of including prior knowledge without having to pay the expensive price of computing the normalizer.
Paper Crump13
On using Amazon Turk for doing experiments. One experiment is about measuring the performance of users when varying the amount of the financial incentive, either $2 and a bonus up to $2.5 based on task performance, or $0.75 with no bonus. Results show that the amount on the incentive does not effect the task performance but does effect the rate at which workers sign up for the task.