Yahoo_LDA
Yahoo_LDA copied to clipboard
streaming mode broken
I followed the setup instructions for single machine, and when I try the streaming mode example, if I input the same string multiple times, I get a different topic categorization every time:
java Tokenizer | ../learntopics -teststream -dumpprefix=../ut_out/lda --topics=100 --dictionary=../ut_out/lda.dict.dump W0720 12:28:20.250313 2803 Controller.cpp:115] ---------------------------------------------------------------------- W0720 12:28:20.250712 2803 Controller.cpp:117] Log files are being stored at /lda/ut_out/learnTopics.* W0720 12:28:20.250731 2803 Controller.cpp:119] ---------------------------------------------------------------------- W0720 12:28:20.251055 2803 Controller.cpp:140] You have chosen single machine testing mode W0720 12:28:20.251379 2803 Unigram_Model_Streaming_Builder.cpp:56] Initializing global dictionary from ../ut_out/lda.dict.dump W0720 12:28:20.308131 2803 Unigram_Model_Streaming_Builder.cpp:59] Dictionary initialized and has 17208 W0720 12:28:20.308279 2803 Unigram_Model_Streaming_Builder.cpp:86] Estimating the words that will fit in 2048 MB W0720 12:28:20.408761 2803 Unigram_Model_Streaming_Builder.cpp:91] 17208 will fit in 1.06012 MB of memory W0720 12:28:20.408906 2803 Unigram_Model_Streaming_Builder.cpp:93] Initializing Local Dictionary from ../ut_out/lda.dict.dump with 17208 words. W0720 12:28:20.491570 2803 Unigram_Model_Streaming_Builder.cpp:122] Local Dictionary Initialized. Size: 34416 W0720 12:28:20.494669 2803 Unigram_Model_Streamer.cpp:64] Initializing Word-Topic counts table from dump ../ut_out/lda.ttc.dump using 17208 words & 100 topics. W0720 12:28:20.549022 2803 Unigram_Model_Streamer.cpp:88] Initialized Word-Topic counts table W0720 12:28:20.549149 2803 Unigram_Model_Streamer.cpp:91] Initializing Alpha vector from dumpfile ../ut_out/lda.par.dump W0720 12:28:20.549247 2803 Unigram_Model_Streamer.cpp:94] Alpha vector initialized W0720 12:28:20.549309 2803 Unigram_Model_Streamer.cpp:97] Initializing Beta Parameter from specified Beta = 0.01 W0720 12:28:20.549383 2803 Unigram_Model_Streamer.cpp:101] Beta param initialized W0720 12:28:20.557430 2803 Testing_Execution_Strategy.cpp:64] Starting Parallel testing Pipeline www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,83) (past,86) (months,77) (noticed,15) (guy,93) (surf,35) (magazine,86) (published,92) (finally,49) (run,21) (copyright,62) (surfboards,27) (rights,90) (reserved,59) (june,63) (launches,26) (improved,40) (site,26) (order,72) (custom,36) (surfboards,11) (online,68) (improvements,67) (top,29) (selling,82) (models,30) (middot,62) (rocket,23) (fish,67) (middot,35) (speed,29) (egg,2) (middot,22) (classic,58) (middot,69) (squash,67) www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,93) (past,56) (months,11) (noticed,42) (guy,29) (surf,73) (magazine,21) (published,19) (finally,84) (run,37) (copyright,98) (surfboards,24) (rights,15) (reserved,70) (june,13) (launches,26) (improved,91) (site,80) (order,56) (custom,73) (surfboards,62) (online,70) (improvements,96) (top,81) (selling,5) (models,25) (middot,84) (rocket,27) (fish,36) (middot,5) (speed,46) (egg,29) (middot,13) (classic,57) (middot,24) (squash,95) www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,82) (past,45) (months,14) (noticed,67) (guy,34) (surf,64) (magazine,43) (published,50) (finally,87) (run,8) (copyright,76) (surfboards,78) (rights,88) (reserved,84) (june,3) (launches,51) (improved,54) (site,99) (order,32) (custom,60) (surfboards,76) (online,68) (improvements,39) (top,12) (selling,26) (models,86) (middot,94) (rocket,39) (fish,95) (middot,70) (speed,34) (egg,78) (middot,67) (classic,1) (middot,97) (squash,2) www.sauritchsurfboards.com/ recreation/sports/aquatic_sports watch out jeremy sherwin is here over the past six months you may have noticed this guy in every surf magazine published jeremy is finally getting his run more.. copyright surfboards 2004 all rights reserved june 6 2004 new launches it s new and improved site you can now order custom surfboards online more improvements to come.. top selling models middot rocket fish middot speed egg middot classic middot squash www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,17) (past,92) (months,52) (noticed,56) (guy,1) (surf,80) (magazine,86) (published,41) (finally,65) (run,89) (copyright,44) (surfboards,19) (rights,40) (reserved,29) (june,31) (launches,17) (improved,97) (site,71) (order,81) (custom,75) (surfboards,9) (online,27) (improvements,67) (top,56) (selling,97) (models,53) (middot,86) (rocket,65) (fish,6) (middot,83) (speed,19) (egg,24) (middot,28) (classic,71) (middot,32) (squash,29)
After digging a little further, I discovered this in Unigram_Model_Streamer::read(google::protobuf::Message& doc):
for (int i = 0; i < wdoc.body_size(); i++) {
top = rand() % _num_topics;
wdoc.add_topic_assignment(top);
}
Looks like it just assigns random topics to words. Is streaming mode just not implemented yet?
How many iterations did you run to learn the model? Did you check if the model looks fine? Only if you have a good model trained, streaming will work.
The random assignments are just initial assignments. They will go through the variational inference and the final assignments won't be random. The streaming mode works pretty fine for us.
@shravanmn , the apparent random results with streaming happens with the example test set the project provides (Yahoo_LDA/docs/html/single__machine__usage.html) - 500 iterations. Does the example not produce a good model with 500 iterations, 100 topics? Even learning topics with a 1000 iterations doesn't appear to help the sample set. Can you confirm? When running it batch mode it consistently provides the same classifications. I will try to debug and understand how the two executions paths are different, but it would be great if anyone can provide some insight and save me some time.
Thank you!