LDAvis icon indicating copy to clipboard operation
LDAvis copied to clipboard

createJSON creates Infinite values for loglift

Open cpsievert opened this issue 10 years ago • 0 comments

This problem might be due to my lack of understanding of preprocessing in mallet, but it might also be helpful/relevant to others so I'm posting anyway.

In a nutshell, the problem is that LDAvis::createJSON must have values greater than zero in the term.frequency parameter and it is fairly easy to accidentally include zeros when using mallet for preprocessing. Consider this example --

library(moviereviews)
data(reviews, package = "moviereviews")
reviews <- sapply(reviews, function(x) paste(x, collapse = ""))
library(tm) # just for the stopwords()
library(mallet) # for the model fitting
writeLines(stopwords(), "stopwords.txt")
doc.ids <- as.character(seq_along(reviews))
mallet.instances <- mallet.import(doc.ids, reviews, "stopwords.txt")
topic.model <- MalletLDA(num.topics = 20)
topic.model$loadDocuments(mallet.instances)
word.freqs <- mallet.word.freqs(topic.model)
# Eliminate infrequent words
stopwordz <- as.character(subset(word.freqs, term.freq <= 5)$words)
subset(word.freqs, term.freq == 0)

          words term.freq doc.freq
12641 parillaud         6        1

writeLines(c(stopwords(), stopwordz, "s", "t"), "stopwords.txt")
# Re-'initiate' topic model without the infrequent words
mallet.instances <- mallet.import(doc.ids, reviews, "stopwords.txt")
topic.model <- MalletLDA(num.topics = 20)
topic.model$loadDocuments(mallet.instances)
word.freqs <- mallet.word.freqs(topic.model)
subset(word.freqs, term.freq == 0)

          words term.freq doc.freq
12641 parillaud         0        0

What seems to be happening is that Mallet automatically throws away "small" documents. This can cause "newly infrequent terms" (in this case 'parillaud' doesn't occur at all) even though we've removed infrequent terms once already. I'm not sure what the best approach is to avoid this situation, but at the very least createJSON should throw a warning if any values in term.frequency are zero.

cpsievert avatar Jun 24 '14 22:06 cpsievert