Linear Topic Modeling
Apr 17th, 2013The interest in learning some kind of topics from a corpus of documents started from the publication of Latent Semantic Analysis (LSA) [Deerwester90], also called Latent Semantic Indexing (LSI) in the context of information retrieval.
LSA is a linear method based on the factorization of the document-word matrix $\boldsymbol X$, where $x_{dw}$ is the count of occurrences of word $w$ in document $d$. The goal is to find a low-rank approximation $\tilde{\boldsymbol X}$ of $\boldsymbol X$ factorizing it into two matrices, one representing the documents, and the other the topics.
In this model, documents are converted to the bag-of-words format which ignores the ordering of words and only keeps the word counts.
Singular Value Decomposition.
The most common way is to use the Singular Value Decomposition (SVD) of
\[\boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T\]where $\boldsymbol U$ and $\boldsymbol V$ are orthonormal matrices and $\boldsymbol \Sigma$ is diagonal. By selecting only the $K$ largest singular values from $\boldsymbol \Sigma$ and the corresponding vectors in $\boldsymbol U$ and $\boldsymbol V^T$, we get the best rank $K$ approximation of matrix $\boldsymbol X$ according to the loss $\Vert \boldsymbol X - \tilde{\boldsymbol X}\Vert_2^2$. Rows of $\boldsymbol U$ represent documents in a $K$-dimensional space, and columns of $\boldsymbol V^T$ represents the topics in the same space. Each document can be expressed as a linear combination of topics.
Non-negative Matrix Factorization.
Another common approach is to use use Non-negative Matrix Factorization (NMF) on the document-word matrix $\boldsymbol X$ [Lee99] [Pauca04]. Here, $\tilde{\boldsymbol X} = \boldsymbol U \boldsymbol V$ where $\boldsymbol U$ represents the topics and $\boldsymbol V$ the documents, both being non-negative matrices.