Image you have a bunch of documents that you’d like to classify based on their topics. The intuitive idea is to pick up keywords of each document, then cluster them into different categories. While you can’t pick them up manually, but you could represent each documents using a matrix. The matrix is called term-document matrix or occurrence matrix. LSA is an important extension of vector space model, also known as LSI(Latent Semantic Indexing).
Term-Document Matrix
Literally, this matrix represents the occurrence of each word in each document. The rows represent the terms, while the colomns represent the documents. More specifically, after you get those documents, e.g. 100 documents, you could build the corresponding vocabulary, e.g. 5000 words. Each document could be represented as a vector with dimension of 5000, with every element representing the occurrence times of corresponding word. Vice versa, every word could also be represented as a vector with dimension of 100, with element representing the occurrence times of that word in corresponding document.
Now we have a matrix, looks like:
The LSA assumes that words that are closed in meaning should appear with a similar times among documents. In this matrix, if there is a word ‘he’, which is the row of the matrix, to be (1, 0, 3, 5, 9,…), similary to word ‘I’. We’re pretty confident that the word ‘he’ is closed to word ‘I’.
Cosine Similarity
To measure the similarity between two words or documents, the common way is to calculate the normalized inner product between two embedding vectors, which could reflect the angle between two vectors.
Lower matrix rank
Now how do you classify the documents or terms based on this matrix? We are expecting to reduce the dimension of this matrix, that is map it into a lower rank matrix. This approach could solve the data sparse problem and compute a shorten embedding of words and documents. In linear algebra, the singular value decomposition (SVD) could handle this.
- Theorem of SVD
For an
real matrix
,
whose element comes from field
. It can be decomposed as form:
where:
is an
real unitary matrix, more specifically, orthogonal matrix.
is an
orthogonal matrix.
is an
rectangular diagonal matrix with non-negative real numbers on the diagonal, known as singular values of
.
Note, the decomposition is not unique.
The columns of and the columns of
are called the left-singular vectors and right-singular vectors of matrix
, respectively. The interesting property is:
where
,
must contain the eigenvectors of
and
, respectively, since
is diagonal matrix.
- SVD in Term-Document Matrix
The singular value decomposition is used in approximating the matrix as SVD of matrices with rank
, which looks like:
where, and
are left and right singular vector.
The original SVD intuitively looks like this:
where, the only part of that contributes to
is called
, similar for
. Note, The term vectors
or doc vectors
are not the eigenvectors, but depends on eigenvectors. Now, our task is to approximate the original matrix
with lower rank matrix
, which has a specific rank
. Frobenius norm suggests that picking up the largest
singular values, corresponding singular vectors to construct a truncated matrix
has a minimum error. We can write the approximated decompostion as:
The row “term” vector then has
entries mapping it to a lower-dimensional space dimensions.
Now, if we’d like compare the similarity between term
, term
, we can just calculate the cosine similary between
and
. Or if we’d like compare two documents,
and
, look at the cosine similary these two vectors,
,
.
Now the problem is how do we calculate the and
. Follow the below derivatives:
For documents, it will be like:
TF-IDF Weights
The term-document matrix that we investigate is the raw count of terms in each documents. We can also use term frequency that is
For now, we are still categorizing the documents by analyzing how many times that words occur. But there is problem, some words, e.g. “the”, “a”, “this”, almost occur every document. Those trival words does not provide useful information in measuring the similarity among documents, while some important, less occurring words, e.g. “movie”, “music”, “sport” do play an important role. So it is suggested to use TF-IDF weights to construct the Term-Document matrix.
The TF here is just the term-frequency as above, while the inverse document frequency measure the importance of specific term, as belows:
with
: total number of documents in the corpus
.
: number of documents where the term
appears. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to
.
It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. That means the more the “specific word” occurs in document and the less documents contain this “specific word” , the more important that “specific word” is to this document
.
Implementation in R
Finally, we have implementation, which is computed with R. The corresponding github repo is here!!!
# load packages
library(dbscan)
library(magrittr)
library(dplyr)
library(lsa)
library(tidytext)
# load text file to be analyzed
data <- fread("~/data.txt", sep = '\n',
header = F, col.names = "documents")
# load stop words document
stop_words <- fread("~/stop_words.txt", sep = '\n', header = F,
col.names = "words")
# documents index
data = data[, doc := c(1:dim(data)[1])]
# clean data
data[, documents := tolower(documents)]
data[, documents := str_replace_all(documents, "[,.'?!:-]", " ")]
# unnest words in documents
words_doc <- unnest_tokens(data, word, documents)
words_doc <- words_doc[word != "" & !(word %in% stop_words),]
words_doc <- words_doc[, .N, by = c("doc", "word")]
# create the doc-term matri x(transpose of term-doc matrix)
doc_term_matrix <- cast_dtm(words_doc, doc, word, N, weighting = tm::weightTf)
# perform lsa
mylsa <- lsa(doc_term_matrix)
# cluster the documents using DBSCAN
db_clust <- dbscan(mylsa$tk, # you can also pick part of factors
eps = db_eps, # define appropriate eps
MinPts = db_minpoints)
# add the cluster of documents into original data
data[, db_cluster := as.factor(db_clust$cluster)]
Other Application
The LSA could also be utilized for information retrieval, which is first implemented in 1988. Since we can get the embedding of documents, we can match documents by calculating the cosine similarity between documents. Howevere, the queries that we’d like to check are mostly new documents. We can try to use
equation to add new vectors. For a new document
, we can create a new column in
using the original weight functions and parameters, then multiply it by
.
Drawbacks of LSA
- LSA can’t reflect the order of words, since we have decomposed the documents into word frequecy of words vocabulary.
- LSA can’t capture polysemy of words. It treat the word with different meaning as same one. The special meaning of word can’t be reflected.
- Inappropriate interprete the meaning. Some words have similar meaning in specific situation, while completely different in other situations. For example, “iced” + “cream” = “ice-cream”, while “iced” + “moisturizer” = “?”.