gensim Exploding Perplexity for big number of topics

Exploding Perplexity for big number of topics

Open snollygoster123123 opened this issue 6 years ago • 5 comments

I am training LDA on a set of ~17500 Documents. Until 230 Topics, it works perfectly fine, but for everything above that, the perplexity score explodes.

(Perplexity was calucated by taking 2 ** (-1.0 * lda_model.log_perplexity(corpus)) which results in 234599399490.052. Usually my perplexity is around 70-150.)

For the perplexity, I am not using a hold out set. It is calculated on the same corpus that was used for training. I can upload the lda model + corpus in the next few minutes.

Apr 11 '19 13:04 snollygoster123123

@snollygoster123123 I remember other users reporting similar issues, since we switched the LDA default precision from double (float64) to single (float32) in #1656.

Can you try this https://github.com/RaRe-Technologies/gensim/issues/217#issuecomment-435539481 and let me know if that helped in any way?

Also, make sure to include all the necessary info here from the issue template: software versions, steps to reproduce, etc.

Apr 11 '19 15:04 piskvorky

Just for note: I also received very large perplexity value with gensim==3.7.1 (even bigger than @snollygoster123123) with training on super-large corpus (13.5kk documents, 850k dictionary, 0.018% density), but:

I cheked topics manually and they looks fine.
Upstream models, based on LDA vectors, works fine too
I used float64 for training (to avoid potential numerical over/under-flow issue)

Apr 16 '19 17:04 menshikh-iv

Ping @snollygoster123123 @menshikh-iv are you able to provide a reproducible example? We'll have a look.

Oct 08 '19 08:10 piskvorky

@piskvorky no I can't (by NDA reasons), sorry. I guess you can try to reproduce that with any large corpus (similar by stats from the previous message)

Oct 08 '19 08:10 menshikh-iv

having this problem with both sklearn and gensim

Mar 15 '22 20:03 Alexander-philip-sage

gensim gensim copied to clipboard

Exploding Perplexity for big number of topics

gensim
gensim copied to clipboard