Linear Topic Modeling

The interest in learning some kind of topics from a corpus of documents started from the publication of Latent Semantic Analysis (LSA) [Deerwester90], also called Latent Semantic Indexing (LSI) in the context of information retrieval.

LSA is a linear method based on the factorization of the document-word matrix X, where xdw is the count of occurrences of word w in document d. The goal is to find a low-rank approximation X~ of X factorizing it into two matrices, one representing the documents, and the other the topics.

In this model, documents are converted to the bag-of-words format which ignores the ordering of words and only keeps the word counts.

Singular Value Decomposition.

The most common way is to use the Singular Value Decomposition (SVD) of

X=UΣVT

where U and V are orthonormal matrices and Σ is diagonal. By selecting only the K largest singular values from Σ and the corresponding vectors in U and VT, we get the best rank K approximation of matrix X according to the loss XX~22. Rows of U represent documents in a K-dimensional space, and columns of VT represents the topics in the same space. Each document can be expressed as a linear combination of topics.

Non-negative Matrix Factorization.

Another common approach is to use use Non-negative Matrix Factorization (NMF) on the document-word matrix X [Lee99] [Pauca04]. Here, X~=UV where U represents the topics and V the documents, both being non-negative matrices.

Comments