warplda
warplda copied to clipboard
performance on nytimes dataset
hi, Warplda is very cool~, I have some trouble when evaluating the performance and need your help.
I got log_lielihood(per token) -10.426584, for nytimes dataset after 100 iterations with mh=2. As the dataset has 99542125 tokens in total, which is about 10M( the paper has a typo error there as 100M), the log_likelihood should be -1.03e+9. But this result is inconsistent with the Fig5, row1.1, where the log_likelihood at 100 iteration is larger than -1e+9.
btw: can the code run in distributed mode?
Hi kyhhdm,
The log likelihood is reported per token, i.e., it is divided by the number of tokens.
This open-sourced version cannot be run in distributed mode. We do have a (premature) distributed LDA code, please check this https://github.com/thu-ml/BigTopicModel
Hello, I am trying to run some comparisons regarding running times and log-likelihood etc. vs Petuum(PMLS) for example.
I have noticed WarpLDA can output both log-likelihood (per iterations) and perplexity(with the -perplexity flag). (And there is no direct way to get word_loglikelihood, doc_loglikelihood, total_loglikelihood without adding code) Quickly looking though the code I found that the definition of perplexity used is: perplexity = e^(-L/NE) where L is the total loglikelihood? and NE is the total number of tokens.
Could you help me clarify these values? should loglikelihood (per token) * numberoftokens = -ln(perplexity)*numberoftokens If not .. is the loglikelihood (per token) in each iteration just an estimation? Or is it the total combined loglikelihood? Is the log_likelihood used in perplexity only the model loglikelihood, or is it the real total loglikelihood?
Many thanks!
Any updates on @mromaios doubt? Could you please help clarify the value and relations ? loglikelihood (per token) , perplexity , how can we get the respective log likelihood per docs in both train and test cases ?
Thanks in advance!