How does LSA classify documents?

Image you have a bunch of documents that you’d like to classify based on their topics. The intuitive idea is to pick up keywords of each document, then cluster them into different categories. While you can’t pick them up manually, but you could represent each documents using a matrix. The matrix is called term-document matrix or occurrence matrix. LSA is an important extension of vector space model, also known as LSI(Latent Semantic Indexing).

Term-Document Matrix

Literally, this matrix represents the occurrence of each word in each document. The rows represent the terms, while the colomns represent the documents. More specifically, after you get those documents, e.g. 100 documents, you could build the corresponding vocabulary, e.g. 5000 words. Each document could be represented as a vector with dimension of 5000, with every element representing the occurrence times of corresponding word. Vice versa, every word could also be represented as a vector with dimension of 100, with element representing the occurrence times of that word in corresponding document.

Now we have a matrix, looks like:

The LSA assumes that words that are closed in meaning should appear with a similar times among documents. In this matrix, if there is a word ‘he’, which is the row of the matrix, to be (1, 0, 3, 5, 9,…), similary to word ‘I’. We’re pretty confident that the word ‘he’ is closed to word ‘I’.

Cosine Similarity

To measure the similarity between two words or documents, the common way is to calculate the normalized inner product between two embedding vectors, which could reflect the angle between two vectors.

$similarity = cos(\theta) = \frac{\mathbf{w_1 \bullet w_2}}{\|\mathbf{w_1}\|\bullet\|\mathbf{w_2}\|}$

Lower matrix rank

Now how do you classify the documents or terms based on this matrix? We are expecting to reduce the dimension of this matrix, that is map it into a lower rank matrix. This approach could solve the data sparse problem and compute a shorten embedding of words and documents. In linear algebra, the singular value decomposition (SVD) could handle this.

- Theorem of SVD

For an $m\times n$ real matrix $\mathbf {M}$ , whose element comes from field $\mathbf {R}$ . It can be decomposed as form:

$\mathbf {M=U\Sigma V^{T}}$

where:

$\mathbf {U}$ is an $m\times m$ real unitary matrix, more specifically, orthogonal matrix.
$\mathbf {V}$ is an $n\times n$ orthogonal matrix.
$\mathbf {V}$ is an $m\times n$ rectangular diagonal matrix with non-negative real numbers on the diagonal, known as singular values of $\mathbf {M}$ .

Note, the decomposition is not unique. The columns of $\mathbf {U}$ and the columns of $\mathbf {V}$ are called the left-singular vectors and right-singular vectors of matrix $\mathbf {M}$ , respectively. The interesting property is:

$\mathbf{M M^{T}} = \mathbf{U \Sigma^{2} U^{T}}$

$\mathbf{M^{T}M} = \mathbf{V \Sigma^{2} V^{T}}$

where $\mathbf {U}$ , $\mathbf {V}$ must contain the eigenvectors of $\mathbf{M M^{T}}$ and $\mathbf{M^{T}M}$ , respectively, since $\Sigma^{2}$ is diagonal matrix.

- SVD in Term-Document Matrix

The singular value decomposition is used in approximating the matrix $\mathbf{M}$ as SVD of matrices with rank $k$ , which looks like:

$\begin{bmatrix} x_{1,1}& \ldots & x_{1,n}\\ \vdots &\vdots & \vdots \\ x_{m,1}&\ldots&x_{m,n} \end{bmatrix} \Rightarrow \begin{bmatrix}\mathbf{u_{1}} & \ldots & \mathbf{u_{k}}\end{bmatrix} \bullet \begin{bmatrix} \sigma_1&\ldots&0\\ \vdots& \ddots&\vdots \\ 0&\ldots&\sigma_k\end{bmatrix} \bullet \begin{bmatrix}\mathbf{v_1}\\\vdots\\\mathbf{v_k}\end{bmatrix}$

where, $\mathbf{u_i}$ and $\mathbf{v_i}$ are left and right singular vector.

The original SVD intuitively looks like this:

$\begin{bmatrix} x_{1,1}& \ldots & x_{1,n}\\ \vdots &\vdots & \vdots \\ x_{m,1}&\ldots&x_{m,n} \end{bmatrix} = \begin{bmatrix}\mathbf{\hat{t}_1^T}}\\\vdots\\\mathbf{\hat{t}_m^T}}\end{bmatrix} \bullet \mathbf{\Sigma}_{m,n} \bullet \begin{bmatrix}\mathbf{\hat{d}_1}&\ldots&\mathbf{\hat{d}_n}\end{bmatrix}$

where, the only part of $\mathbf{U}$ that contributes to $\mathbf{t}^T_i$ is called $\mathbf{\hat{t}^T_i}$ , similar for $\mathbf{\hat{d}_{i}}$ . Note, The term vectors $\mathbf{\hat{t}^{T}_{i}}$ or doc vectors $\mathbf{\hat{d}_{i}}$ are not the eigenvectors, but depends on eigenvectors. Now, our task is to approximate the original matrix $\mathbf{M}$ with lower rank matrix $\mathbf{\tilde{M}}$ , which has a specific rank $k$ . Frobenius norm suggests that picking up the largest $k$ singular values, corresponding singular vectors to construct a truncated matrix $\mathbf{\tilde{M}}$ has a minimum error. We can write the approximated decompostion as:

$\mathbf{X_k} = \mathbf{U_k}\mathbf{\Sigma_k}\mathbf{V_k}^T$

The row “term” vector $\mathbf{\hat{t}}^{T}_{i}$ then has $k$ entries mapping it to a lower-dimensional space dimensions. Now, if we’d like compare the similarity between term $i$ , term $j$ , we can just calculate the cosine similary between $\mathbf{\hat{t}_i}^T\math{\Sigma_k}$ and $\mathbf{\hat{t}_j}^T\math{\Sigma_k}$ . Or if we’d like compare two documents, $i$ and $j$ , look at the cosine similary these two vectors, $\math{\Sigma_k}\mathbf{\hat{d}_i}$ , $\math{\Sigma_k}\mathbf{\hat{d}_j}$ .

Now the problem is how do we calculate the $\mathbf{\hat{t}}^{T}_{i}$ and $\mathbf{\hat{d}}_{i}$ . Follow the below derivatives:

$\mathbf{t}_i^T = \mathbf{\hat{t}}_i^T\mathbf{\Sigma_k}\mathbf{V_k}$

$\mathbf{\hat{t}}_i^T = \mathbf{t}_i^T\mathbf{V_k^{-T}}\mathbf{\Sigma_k^{-1}} = \mathbf{t}_i^T\mathbf{V_k}\mathbf{\Sigma_k^{-1}}$

For documents, it will be like:

$\mathbf{\hat{d}}_i = \mathbf{\Sigma_k^{-1}}\mathbf{U_k^T}\mathbf{d}_i$

TF-IDF Weights

The term-document matrix that we investigate is the raw count of terms in each documents. We can also use term frequency that is

$\frac{n_{t_i,d}}{\sum_i{n_{t_{i},d}}}$

For now, we are still categorizing the documents by analyzing how many times that words occur. But there is problem, some words, e.g. “the”, “a”, “this”, almost occur every document. Those trival words does not provide useful information in measuring the similarity among documents, while some important, less occurring words, e.g. “movie”, “music”, “sport” do play an important role. So it is suggested to use TF-IDF weights to construct the Term-Document matrix.

The TF here is just the term-frequency as above, while the inverse document frequency measure the importance of specific term, as belows:

$idf(t,D) = log\frac{N}{| \{d\in D: t \in d \}| }$

with

$N$ : total number of documents in the corpus $N = |D|$ .
$|\{d\in D: t\in d\}|$ : number of documents where the term $t$ appears. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to $1 + |\{d \in D: t \in d\}|$ .

It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. That means the more the “specific word” occurs in document $i$ and the less documents contain this “specific word” , the more important that “specific word” is to this document $i$ .

Implementation in R

Finally, we have implementation, which is computed with R. The corresponding github repo is here!!!

# load packages
library(dbscan)
library(magrittr)
library(dplyr)
library(lsa)  
library(tidytext) 

# load text file to be analyzed
data <- fread("~/data.txt", sep = '\n', 
                 header = F, col.names = "documents")

# load stop words document
stop_words <- fread("~/stop_words.txt", sep = '\n', header = F,
                    col.names = "words")
                    
# documents index
data = data[, doc := c(1:dim(data)[1])]
  
# clean data
data[, documents := tolower(documents)] 
data[, documents := str_replace_all(documents, "[,.'?!:-]", " ")]   

# unnest words in documents
words_doc <- unnest_tokens(data, word, documents) 
  
words_doc <- words_doc[word != "" & !(word %in% stop_words),]

words_doc <- words_doc[, .N, by = c("doc", "word")]     

# create the doc-term matri x(transpose of term-doc matrix)
doc_term_matrix <- cast_dtm(words_doc, doc, word, N, weighting = tm::weightTf)
  
# perform lsa
mylsa <- lsa(doc_term_matrix)

# cluster the documents using DBSCAN
db_clust <- dbscan(mylsa$tk, 		# you can also pick part of factors
                   eps = db_eps, 	# define appropriate eps 
                   MinPts = db_minpoints)
                   
# add the cluster of documents into original data
data[, db_cluster := as.factor(db_clust$cluster)]

Other Application

The LSA could also be utilized for information retrieval, which is first implemented in 1988. Since we can get the embedding of documents, we can match documents by calculating the cosine similarity between documents. Howevere, the queries that we’d like to check are mostly new documents. We can try to use $\mathbf {M=U\Sigma V^{T}}$ equation to add new vectors. For a new document $d$ , we can create a new column in $\mathbf{A}$ using the original weight functions and parameters, then multiply it by $\mathbf{U\Sigma^{-1}}$ .

Drawbacks of LSA

LSA can’t reflect the order of words, since we have decomposed the documents into word frequecy of words vocabulary.
LSA can’t capture polysemy of words. It treat the word with different meaning as same one. The special meaning of word can’t be reflected.
Inappropriate interprete the meaning. Some words have similar meaning in specific situation, while completely different in other situations. For example, “iced” + “cream” = “ice-cream”, while “iced” + “moisturizer” = “?”.

Reference

LSA: https://en.wikipedia.org/wiki/Latent_semantic_analysis

How does LSA classify documents?

Simplest NLP technique

How does LSA classify documents?

Simplest NLP technique

Term-Document Matrix

Cosine Similarity

Lower matrix rank

- Theorem of SVD

- SVD in Term-Document Matrix

TF-IDF Weights

Implementation in R

Other Application

Drawbacks of LSA

Reference