My peronal bibliography for topic modeling, with some notes about how the paper is useful to me.

`Landauer97`

`Deerwester90`

LSA (or LSI) is a linear topic model based the factorization of the document-word matrix $\boldsymbol X$, where $x_{dw}$ is the count of occurrences of word $w$ in document $d$. The goal is to find a low-rank approximation $\tilde{\boldsymbol X}$ of $\boldsymbol X$ factorizing it into two matrices, one representing the documents, and the other the topics.

Use SVD on $ \boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T$. By selecting only the $K$ largest singular values from $\boldsymbol \Sigma$ and the corresponding vectors in $\boldsymbol U$ and $\boldsymbol V^T$, we get the best rank $K$ approximation of matrix $\boldsymbol X$. Rows of $\boldsymbol U$ represent documents, and rows of $\boldsymbol V^T$ represents the topics. Each document can be expressed as a linear combination of topics.

`Hofmann99`

Introduced pLSI, a probabilistic topic model based on the following generative process. No priors on $\boldsymbol \theta_d$ and $\boldsymbol \phi_k$.

```
for document d = 1..D:
for position i = 1..N in document d:
Draw a topic z_di from Discrete(theta_d)
Draw a word w_di from Discrete(phi_z_di)
```

`Ding08`

Proved the equivalence between pLSI and NMF, by showing that they both optimize the same objective function. As they are different algorithms, this allow to design an hybrid method alternating between NMF and pLSI, every time jumping out of the local optimum of the other method.

Chronologically, Blei, Ng and Jordan first published [Blei02] presenting LDA in *NIPS* treating topics $\boldsymbol \phi_k$ as free parameters.
Shortly after, Griffiths and Steyvers published [Griffiths02a] and [Griffiths02b] extending this model by adding a symmetric Dirichlet prior on $\boldsymbol \phi_k$.
Finally, Blei, Ng and Jordan published an extended version [Blei03] of their first paper in *Journal of Machine Learning Research* (by far the most cited LDA paper) with a section on having this Dirichlet smoothing on multinomial parameters $\boldsymbol \phi_k$.

`Blei02`

First first paper for LDA, quite short, not used. See [Blei03].

`Griffiths02a`

Derive a Gibbs sampler for LDA and introduce Dirichlet prior on topics $\boldsymbol \phi_k$.

`Griffiths02b`

Almost the same paper as [Griffiths02a].

`Blei03`

Most cited paper for LDA, extended version of [Blei02].

`Griffiths04`

Less technical paper showing application of LDA to several datasets.

`Heinrich04`

Heavily detailed tutorial about LDA, and inference using Gibbs sampling.

`Steyvers06`

LDA has been around for 3 years, they give an in-depth review and analysis of probabilistic models, full of deep insights. hey propose measure capturing similarity between topics (KL, KLsym, JS, cos, L1, L2), between a set of words and documents, and between words.

`Blei12`

A short, high-level review on topic models. Not technical.

`Hoffman10`

Present an online version of the Variational EM algorithm introduced in [Blei03].

`Wei06`

As an extrinsic evaluation method of topics, used discovered topics for information retrieval.

`Chang09`

Shown that surprisingly predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated.

They ran a large scale experiment on the Amazon Turk platform. For each topic, they took the five top words of that topics and added a random sixth word. Then, they presented these lists of six words to people asking them which is the intruder word.

If all the people asked could tell which is the intruder, then we can conclude safely that the topic is good at describing an idea. If on the other hand, many people identified other words as the intruder, it means that they could not see the logic into the association of words, and we can conclude the topic was not good enough.

`Wallach09a`

Gives tons of methods to compute approximations of the likelihood $p(\boldsymbol w_d | \boldsymbol \Phi, \alpha)$ for one unseen document, which is intractable but needed to evaluate topic models.

`Buntine09`

Like [Wallach09a], gives method to estimate the likelihood of unseen documents.

`AlSumait09`

Define measures based on three prototypes of junk and insignificant topics to rank discovered topics according to their significance score. The three junk prototypes are the uniform word-distribution, the empirical corpus word-distribution, and the uniform document-distribution: $$ p(w | \text{topic}) \propto 1 \qquad p(w | \text{topic}) \propto \text{count}(w \text{ in corpus}) \qquad p(d | \text{topic}) \propto 1 $$ Then the topic significance score is based on combinations of dissimilarities (KL divergence, cosine, and correlation) from those three junk prototypes. However

`Newman10c`

Tries different coherence measure on different dataset to compare them.

`Mimno11b`

Presents a Baysian methods measuring how well a topics model fits a corpus.

`Newman10a`

Introduced the UCI coherence measure $\sum_{i<j} \log \frac{p(w_i, w_j)}{p(w_i)p(w_j)}$ for $w_1$, ..., $w_{10}$ top words (based on PMI). The measure is extrinsic as it uses empirical probabilities from an external corpus such as Wikipedia.

`Mimno11a`

Introduced the UMass coherence measure $\sum_{i<j} \log \frac{1+D(w_i, w_j)}{D(w_i)}$ for $w_1$, ..., $w_{10}$ top words (intrinsic measure).

`Stevens12`

Explore computing the two coherence metrics UCI from [Newman10a] and UMass from [Mimno11a] on multiple datasets, for different number of topics. The aggregate results (computing average and entropy). They assume these two are good metrics and use them to compare different topic models: LDA, LSA+SVD, and LSA+NMF.

`Andr09`

Make the discovery of topics semi-supervised where a user repeatedly gives orders on top words of discovered topics: "those words should be in the same topic", "those "rds should not be in the same topic", and "those "rds should be by themselves". Orders are encoded into pair-wise constraints on words: two words have to or can not be in the same topic. Then the model is trained again with a complex new prior encoding the constraints based on Dirichlet Forests.

`Hu11`

Extended approach from [Andr09] proposing interactive topic modeling (ITM) where we don't have to start over the Gibbs sampler after each human action. Instead, the prior is updated in-place to incorporate the new constraints and the underlying model is changed and seen a starting position for a new Markov chain. Updating the model is done by state ablation; invalidate some topic-word assignments by setting $z = -1$. The counts are decremented accordingly

They explore several strategies of invalidation: invalidates all assignments, only of documents that have any of the terms constraints, only of the terms concerned, or none. After each human actions, the Gibbs sample runs for 30 more iterations before asking for human feedback again. Experiments have been done using Amazon Mechanical Turk.

`Pauca04`

Reference for successful use of NMF for topic modeling.

`Lee99`

Reference for successful use of NMF for topic modeling.

`Doyle09`

Elkan's paper about burstiness.

`Wallach09b`

Study the effect of different priors on LDA output.

`Andr11`

Use discovered topics in a search engine, use query expansion (like we do in Squid).

`Chang10`

Presents a "manual" topic modeling where humans are asked to repeatedly tag documents and cluster these annotations.

`Chuang12`

Define the distinctiveness $\mathcal{D}(w)$ of a word as the KL divergence between the topic distribution $p(k|w)$ given the word $w$ and the empirical topic distribution $p(k)$ of the corpus. Also, the distinctiveness of a word can be weighted by its frequency, this how we define the saliency $\mathcal{S}(w)$ of a word. $$ \mathcal{D}(w) = \sum_{\text{topic } k} p\left(k|w\right) \log \frac{p(k|w)}{p(k)} = \text{KL}\big(p(k|w) \ \Vert \ p(k)\big) \qquad \mathcal{S}(w) = p(w) \mathcal{D}(w) $$ Also present a new visualization of topics distributions based on a matrix of circles, and a word ordering such that topics span contiguous words.

`Campbell66`

Reference for argument: given a distribution, the exponential entropy is a measure of the extent, or spread, of the distribution. For eg, measure how much a word is shared across several topics.

`Geman84`

Introduced Gibbs sampling.

`Bottou98`

Convergence of online algorihtms, gives the condition $\sum_{t=0}^\infty \rho_t^2 < \infty$ needed to prove the convergence of Online Variational LDA [Hoffman10].

`Lee00`

Reference for NMF.

`Bishop06`

Heavy book used a reference on Bayesian methods.

`Tzikas08`

A step-by-step tutorial on EM algorithm, following closely the Bishop book [Bishop06]. They described the MAP as poor man's Bayesian inference as this is a way of including prior knowledge without having to pay the expensive price of computing the normalizer.

`Crump13`

On using Amazon Turk for doing experiments. One experiment is about measuring the performance of users when varying the amount of the financial incentive, either \$2 and a bonus up to \$2.5 based on task performance, or \$0.75 with no bonus. Results show that the amount on the incentive does not effect the task performance but does effect the rate at which workers sign up for the task.