Problem inferring topics on new docs using a saved model
Here is a dummied up script to test what seems to be a bug with inference in dfrtopics
options(java.parameters="-Xmx6g") library(dfrtopics) library(dplyr)
#first create some dummy data for repeatability. Read in moby dick from gutenberg. Since readlines breaks at the newline char we'll treat each newline as a new "text"
texts <- text_of_file <- readLines("http://www.gutenberg.org/files/2701/2701-0.txt")
#Now remove those pesky blanks
texts <- texts[-which(texts == "")]
#Grab 2000 random items for training and put into dataframe with proper colnames and some dummied id labels
training_docs <- data_frame(id = paste("Train", 1:2000, sep="_"), text = sample(texts, 2000))
#Now grab another 100 that we'll pretend are new documents for inference later on
inference_docs <- data_frame(id = paste("Test", 1:100, sep="_"), text = sample(texts, 100))
#Make an instance list for the training docs (for the sake of this demo, no stoplist)
training_ilist <- make_instances(training_docs)
#Train a topic model
m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966)
#Now write the model to disk so we can load it later. Also write out the instance list, we're going to need it.
write_mallet_model(m, "DEMO_MODEL", save_instances = TRUE)
#Before we can infer the topical makeup of new files, we need a compatible instance list (aka use-pipe-from in mallet)
#For some reason, load_mallet_model_directory does not load the instance file that we saved above as part of the write_mallet_model . . . I'm not sure why?
#Interestingly, we can build an inferencer from the model before reloading it using load_mallet_model_directory, but it does not work after loading. in other words: this works correctly
inf <- inferencer(m) inf
#But once we relaod the model from file, like this
m <- load_mallet_model_directory("DEMO_MODEL") #DEMO_MODEL = local path
#We can't create an inferencer inf <- inferencer(m) inf # returns NULL
#Hmm, that's weird. Imagine that we quit R and want to come back another day and load the model and do some inference on some new files. It looks like we cannot do that.
#But maybe there is another route. I saved the instance list, so perhaps I can read it in and then use it in conjunction with the compatible_instances(docs, instances) function
ilist <- read_instances("DEMO_MODEL/instances.mallet") inference_ilist <- compatible_instances(inference_docs, ilist)
#Ok, so now we've got a loaded model from disk and a compatiable instance list. I should be able to infer topics on new docs. . .
inferred_m <- infer_topics(m, inference_ilist) # Tada!
#But no. . . .
#Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, : #RcallMethod: invalid object parameter
#According to the help file: m can be either a topic inferencer object from read_inferencer or inferencer or a mallet_model object. m is of the later type:
class(m) [1] "mallet_model"
#So why the error?
#Let's try another route. rebuild the same model
m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966) m_inferencer <- inferencer(m)
#Save it to disk
write_inferencer(m_inferencer, "DEMO_MODEL/m_inferencer.mallet")
#Read the inference from the file
inf <- read_inferencer("DEMO_MODEL/m_inferencer.mallet") test <- infer_topics(inf, inference_ilist)
#Ugh. same error again . . . #Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, : #RcallMethod: invalid object parameter
#What now?