LDAvis
LDAvis copied to clipboard
createJSON creates Infinite values for loglift
This problem might be due to my lack of understanding of preprocessing in mallet, but it might also be helpful/relevant to others so I'm posting anyway.
In a nutshell, the problem is that LDAvis::createJSON
must have values greater than zero in the term.frequency
parameter and it is fairly easy to accidentally include zeros when using mallet for preprocessing. Consider this example --
library(moviereviews)
data(reviews, package = "moviereviews")
reviews <- sapply(reviews, function(x) paste(x, collapse = ""))
library(tm) # just for the stopwords()
library(mallet) # for the model fitting
writeLines(stopwords(), "stopwords.txt")
doc.ids <- as.character(seq_along(reviews))
mallet.instances <- mallet.import(doc.ids, reviews, "stopwords.txt")
topic.model <- MalletLDA(num.topics = 20)
topic.model$loadDocuments(mallet.instances)
word.freqs <- mallet.word.freqs(topic.model)
# Eliminate infrequent words
stopwordz <- as.character(subset(word.freqs, term.freq <= 5)$words)
subset(word.freqs, term.freq == 0)
words term.freq doc.freq
12641 parillaud 6 1
writeLines(c(stopwords(), stopwordz, "s", "t"), "stopwords.txt")
# Re-'initiate' topic model without the infrequent words
mallet.instances <- mallet.import(doc.ids, reviews, "stopwords.txt")
topic.model <- MalletLDA(num.topics = 20)
topic.model$loadDocuments(mallet.instances)
word.freqs <- mallet.word.freqs(topic.model)
subset(word.freqs, term.freq == 0)
words term.freq doc.freq
12641 parillaud 0 0
What seems to be happening is that Mallet automatically throws away "small" documents. This can cause "newly infrequent terms" (in this case 'parillaud' doesn't occur at all) even though we've removed infrequent terms once already. I'm not sure what the best approach is to avoid this situation, but at the very least createJSON
should throw a warning if any values in term.frequency
are zero.