Yahoo_LDA
Yahoo_LDA copied to clipboard
Batch vs. Streaming Classification - "Using the Model"
Hello,
I wasn't sure which was the best forum to post this issue/question to - the yahoo groups or hear. It seems issues have more activity than in the groups. (I've cross posted: http://tech.groups.yahoo.com/group/y_lda/message/15)
I'm a total newbie to LDA, so please forgive me if I don't quite formulate this question concisely.
From the single machine instructions for "Using the Model" (/Yahoo_LDA/docs/html/single__machine__usage.html#using_model) it indicates that you can run in either batch OR streaming mode.
In batch mode, the output are several files: lda.docToTop.txt lda.topToWor.txt lda.worToTop.txt
lda.docToTop.txt is what I like - document - topic assignments. e.g. www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (65,0.138889) (54,0.111111) (9,0.0833333) (21,0.0833333) (27,0.0833333) (87,0.0833333) (29,0.0555556) (52,0.0555556) (56,0.0555556) (72,0.0555556)
However, in streaming mode, it seems to be returning to me document word to topic assignments similar to batch mode's lda.worToTop.txt. e.g. www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,87) (past,87) (months,72) (noticed,21) (guy,52) (surf,27) (magazine,87) (published,10) (finally,21) (run,21) (copyright,54) (surfboards,27) (rights,54) (reserved,54) (june,72) (launches,73) (improved,9) (site,54) (order,73) (custom,56) (surfboards,27) (online,52) (improvements,9) (top,9) (selling,6) (models,29) (middot,65) (rocket,44) (fish,56) (middot,65) (speed,65) (egg,95) (middot,65) (classic,29) (middot,65) (squash,55)
Can I make streaming mode return doc - topic assignments?
If not, can I compute the doc-topic assignments easily from the doc word - topic assignment output?
I would like to call the streaming mode from a Java process.
Please help. :)
Thanks! -John
I found the logic in the batch mode that reports doc-topic: void Unigram_Model_Training_Builder::create_output()
Basically doc topic assignments are computed from word-topic assignments using a score ratio of the total count of each topic in topic-word divided by total number of words: topicCount / totalNumWordsInDoc
The logic responsible for returning results in the stream mode is void Unigram_Model_Streamer::write(void* token)
I added the logic from create_output() to the streamer::write() method and now it returns [doc-topic,score] [doc-topic,score] ... || (word,topic) (word,topic) ...
e.g. www.sauritchsurfboards.com/ recreation/sports/aquatic_sports www.sauritchsurfboards.com/ recreation/sports/aquatic_sports [3,0.0555556] [9,0.0555556] [12,0.0555556] [40,0.0555556] [78,0.0555556] [33,0.0555556] [58,0.0277778] [60,0.0277778] [65,0.0277778] [67,0.0277778] || (watch,7) (past,49) (months,73) (noticed,58) (guy,30) (surf,72) (magazine,44) (published,78) (finally,23) (run,9) (copyright,40) (surfboards,65) (rights,92) (reserved,42) (june,87) (launches,3) (improved,27) (site,29) (order,40) (custom,12) (surfboards,3) (online,69) (improvements,9) (top,57) (selling,60) (models,33) (middot,99) (rocket,78) (fish,16) (middot,35) (speed,97) (egg,26) (middot,12) (classic,67) (middot,10) (squash,33)
Does anyone see an erro in this logic?
If anyone is interested, I can post the code.