Linear Topic Modeling

The interest in learning some kind of topics from a corpus of documents started from the publication of Latent Semantic Analysis (LSA) [Deerwester90], also called Latent Semantic Indexing (LSI) in the context of information retrieval.

LSA is a linear method based on the factorization of the document-word matrix $\boldsymbol X$, where $x_{dw}$ is the count of occurrences of word $w$ in document $d$. The goal is to find a low-rank approximation $\tilde{\boldsymbol X}$ of $\boldsymbol X$ factorizing it into two matrices, one representing the documents, and the other the topics.

In this model, documents are converted to the bag-of-words format which ignores the ordering of words and only keeps the word counts.

Singular Value Decomposition.

The most common way is to use the Singular Value Decomposition (SVD) of

\[\boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T\]

where $\boldsymbol U$ and $\boldsymbol V$ are orthonormal matrices and $\boldsymbol \Sigma$ is diagonal. By selecting only the $K$ largest singular values from $\boldsymbol \Sigma$ and the corresponding vectors in $\boldsymbol U$ and $\boldsymbol V^T$, we get the best rank $K$ approximation of matrix $\boldsymbol X$ according to the loss $\Vert \boldsymbol X - \tilde{\boldsymbol X}\Vert_2^2$. Rows of $\boldsymbol U$ represent documents in a $K$-dimensional space, and columns of $\boldsymbol V^T$ represents the topics in the same space. Each document can be expressed as a linear combination of topics.

Non-negative Matrix Factorization.

Another common approach is to use use Non-negative Matrix Factorization (NMF) on the document-word matrix $\boldsymbol X$ [Lee99] [Pauca04]. Here, $\tilde{\boldsymbol X} = \boldsymbol U \boldsymbol V$ where $\boldsymbol U$ represents the topics and $\boldsymbol V$ the documents, both being non-negative matrices.

Comments