Last post, we went through the LSA to get a embed matrix. However, clustering documents shows an obvious shortage, since documents may have multiple themes. This post introduces the LDA which utilizes the Bayesian inference to get the posterior probability of topics in each document, also the posterior probability of words in each topic.

Latent Dirichlet allocation (LDA) is an example of a topic model and was first presented as a graphical model for topic discovery. The LDA allows multiple topics for each document, by showing the probablilty of each topic. For example, a document may have 90% probability of topic A and 10% probability of topic B. Also it also analyze the probability of words in each topic. For example, the word “summer” is with 15% probability in topic A.

There are two posterior probability that LDA cares about:

  • The probability of topics in each document.
  • The probability of words in each topic.

Since it utilizes the Bayesian inference, symbol notations are more appropriate to elucidate the conditional probability.

Bayesian equation

LDA Model

LDA assumes the following generative process for a corpus consisting of documents each of length and available topics. There are 4 priors assumed,

  1. We assume topic distribution per document follows .
  2. We assume words distribution per topic follows .
  3. The topic for document , postion follows the .
  4. The word for document , position follow the

where, , , , and

denotes the vocabulary size of total words.

The more detailed derivations could be found in Wiki, Inference. I am planning to crack it this semester during class “Statistical Graphic Model”.

By applying LDA model, we could get the conditional distribution of , that is conditional distribution of topics in document , and , that is distribution of words in topic .

When implemented in R, you could get a list of probability of each word in specific topic and a list of probability of each topic in specific document. There is definitely a complex derivation if proved mathematically. But the result really performs better than LSA.

TF-IDF Weight

There is a shortage in LDA model, since it treats every word with same weight. In last post, we talked about the TF-IDF weight to reweigh the items. In this model, we also deploy the TF-IDF weight to filter the useless and high-frequency words, like, “the”, “okay”, which depends on where your data is from. You can check the code, where I set the threshold to be 0.0005.

Now, here is the example code of LDA in R, referred from the book: Text mining with R Text mining with R. The data here is a txt file with each line being a document.

Define the topics manualy

Note, we have to set how many topics we’d like to assign to each documents before we run the model. Also, it is called “Latent”, so the computer can’t give us the topic names directly. We have to figure out what the topics are by inspecting the documents in each topic in person. In the code, you will see a part where you can rename the topics.

Implementation in R

The corresponding github repo is here !!! Or you can quicklook the codes below:

library(data.table)

library(ggplot2)

library(stringr)

library(topicmodels)  

library(tidytext) 

library(tidyr)

library(grid)

### set variables ----

num_topic <- 6

### Load data ----

# import documents

data <- fread("~/data_path", sep = '\n', 
                 header = F, col.names = "documents")

# import normal stop words document

stop_words <- fread("~/stop_words_path", sep = '\n', header = F,
                    col.names = "words")

stopwords <- as.matrix(stop_words)

### clean audio text ----

# remove null content and leave a message call

data[, documents:= tolower(documents)]

data <- data[documents != "none"]

# remove punctuation

data[, documents := str_replace_all(documents, "[,.'?!:-]", " ")]

# unnest words in transcript

data[, doc := c(1:dim(data)[1])]

words_doc <- unnest_tokens(data, word, documents) 

words_doc <- words_doc[word != "" & !(word %in% stopwords),]

words_doc <- words_doc[, .N, by = c("doc", "word")]

words_doc <- bind_tf_idf(words_doc, word, doc, N)

# filter words using tf-idf weights: threshold picked as .0005

words_doc <- words_doc[tf_idf > 0.0005, ]

# create the doc-term matrix

doc_term_matrix <- cast_dtm(words_doc, doc, word, N, weighting = tm::weightTf)


### perform lda ----

audio_lda <- LDA(doc_term_matrix, k = num_topic, control = list(seed = 1001))

audio_topics <- tidy(audio_lda)


### create table for plots: distribution of topics in each document. ----

topic_doc <- data.table(cbind(as.numeric(audio_lda@documents), as.matrix(audio_lda@gamma)))

colnames(topic_doc) <- c("doc", str_c("topic_", c(1:6)))

topic_doc = melt(topic_doc, measure.vars = c("topic_1", "topic_2", "topic_3", "topic_4", "topic_5", "topic_6"),
                 variable.name = "Topic", value.name = "Probability")

# define topics manually that you think are appropriate.

topics <- c("Topic_1", "Topic_2", "Topic_3", 
            
            "Topic_4", "Topic_5", "Topic_6")

audio_topics <- data.table(audio_topics)

audio_topics[, topics := topics[topic]]

### visulize results from lda ----

# define colors

colors <- c("darkturquoise","lightcoral", "khaki4", "green4", "hotpink", "steelblue3")

# produce plot 1: percentage of each topic over all documents

xx <- data.table(topic_doc)

xx[, max_prob := max(Probability), by = doc]

xx[, is_max_prob := Probability == max_prob]

percent <- xx[, list(percent = sum(is_max_prob)/dim(topic_doc)[1]), by = Topic]

percent[, topics := topics]

ggplot(percent) + geom_bar(aes(x = topics, y = percent, fill = topics), stat = "identity") + 
  ggtitle("Percentage of each Topic") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) + coord_polar()


# produce plot 2: six topics and corresponding top 15 keywords in specific topics

audio_plot <- list()

for (i in 1 : num_topic) {
  
  plot_data <- setorder(setDT(audio_topics), -beta)[, head(.SD, 15), keyby = topic]
  
  plot_data <- plot_data[topics == topics[num_topic], ]
  
  audio_plot[[i]] <- ggplot(plot_data) + 
    geom_bar(aes(y = beta, x = reorder(term, beta)), fill = colors[i],
             stat = "identity") + facet_wrap(~ topics, scale = "free") + 
    ylab("Probability") + xlab("Terms") + theme(axis.title = element_text(size = 8)) +
    coord_flip() + scale_fill_discrete(name="Topics")
  
}


## set up new page

grid.newpage()

nrow <- 2

ncol <- 3

pushViewport(viewport(layout = grid.layout(nrow, ncol)))

## define how graph of each topic distribution located in the plots

loc_col <- function(num) {
  if (num <= ncol) {
    return(num)
  }
  
  else {
    return(num - ncol)
  }
}

loc_row <- function(num) {
  if (num <= ncol) {
    return(1)
  }
  else{
    return(2)
  }
}

## now plot the 6 topics

for (i in 1:num_topic) {
  
  print(audio_plot[[i]], vp = viewport(layout.pos.row = loc_row(i),
                                       layout.pos.col = loc_col(i)))
}


# produce plot 3: distribution of topics in all documents

## store the plots

num_doc_ana <- dim(cancels)[1]

doc_analysis <- list()

for (i in 1:num_doc_ana) {
  
  plot_data <- data.table()
  
  plot_data <- topic_doc[doc == i, ]
  
  doc_analysis[[i]] <- ggplot(plot_data) + geom_bar(aes(y = Probability, x = topics, fill = topics), stat = "identity") + 
    
    ggtitle(str_c("Distribution of topic in Document ", i)) + 
    
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
}


## actually plot topic distribution in all documents and Export to computer

for (i in 1:dim(cancels)[1]) {
  
  file_name <- str_c("Document_", i, ".png")
  
  png(file = str_c("output_path", "/", file_name))
  
  print(doc_analysis[[i]])
  
  dev.off()
}


Reference

The incomplete derivation refers wiki and the codes refer “Text Mining with R” which looks in fact a lot different from mine. I use the “data.table” to make sure it run fast.