dfrtopics icon indicating copy to clipboard operation
dfrtopics copied to clipboard

Problem inferring topics on new docs using a saved model

Open mjockers opened this issue 8 years ago • 0 comments

Here is a dummied up script to test what seems to be a bug with inference in dfrtopics

options(java.parameters="-Xmx6g") library(dfrtopics) library(dplyr)

#first create some dummy data for repeatability. Read in moby dick from gutenberg. Since readlines breaks at the newline char we'll treat each newline as a new "text"

texts <- text_of_file <- readLines("http://www.gutenberg.org/files/2701/2701-0.txt")

#Now remove those pesky blanks

texts <- texts[-which(texts == "")]

#Grab 2000 random items for training and put into dataframe with proper colnames and some dummied id labels

training_docs <- data_frame(id = paste("Train", 1:2000, sep="_"), text = sample(texts, 2000))

#Now grab another 100 that we'll pretend are new documents for inference later on

inference_docs <- data_frame(id = paste("Test", 1:100, sep="_"), text = sample(texts, 100))

#Make an instance list for the training docs (for the sake of this demo, no stoplist)

training_ilist <- make_instances(training_docs)

#Train a topic model

m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966)

#Now write the model to disk so we can load it later. Also write out the instance list, we're going to need it.

write_mallet_model(m, "DEMO_MODEL", save_instances = TRUE)

#Before we can infer the topical makeup of new files, we need a compatible instance list (aka use-pipe-from in mallet)

#For some reason, load_mallet_model_directory does not load the instance file that we saved above as part of the write_mallet_model . . . I'm not sure why?

#Interestingly, we can build an inferencer from the model before reloading it using load_mallet_model_directory, but it does not work after loading. in other words: this works correctly

inf <- inferencer(m) inf

#But once we relaod the model from file, like this

m <- load_mallet_model_directory("DEMO_MODEL") #DEMO_MODEL = local path

#We can't create an inferencer inf <- inferencer(m) inf # returns NULL

#Hmm, that's weird. Imagine that we quit R and want to come back another day and load the model and do some inference on some new files. It looks like we cannot do that.

#But maybe there is another route. I saved the instance list, so perhaps I can read it in and then use it in conjunction with the compatible_instances(docs, instances) function

ilist <- read_instances("DEMO_MODEL/instances.mallet") inference_ilist <- compatible_instances(inference_docs, ilist)

#Ok, so now we've got a loaded model from disk and a compatiable instance list. I should be able to infer topics on new docs. . .

inferred_m <- infer_topics(m, inference_ilist) # Tada!

#But no. . . .

#Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, : #RcallMethod: invalid object parameter

#According to the help file: m can be either a topic inferencer object from read_inferencer or inferencer or a mallet_model object. m is of the later type:

class(m) [1] "mallet_model"

#So why the error?

#Let's try another route. rebuild the same model

m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966) m_inferencer <- inferencer(m)

#Save it to disk

write_inferencer(m_inferencer, "DEMO_MODEL/m_inferencer.mallet")

#Read the inference from the file

inf <- read_inferencer("DEMO_MODEL/m_inferencer.mallet") test <- infer_topics(inf, inference_ilist)

#Ugh. same error again . . . #Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, : #RcallMethod: invalid object parameter

#What now?

mjockers avatar Dec 04 '17 21:12 mjockers