kenlm Computing perplexity on different sized text corpuses

Computing perplexity on different sized text corpuses

Open tomassykora opened this issue 5 years ago • 1 comments

Hello, I'd like to compute perplexity on different text corpuses given an ngram computed with kenlm. I found in some old issues that --vocab_pad param should be used with a big number in similar situations. But I'm really not sure if I got it right and this is the situation.

Can I just compute ngram with lmpz with this option and then run query with the given ngram on the given text corpus? Or something else should be done? Currently it seems that the bigger the corpus the bigger the ppl which makes me think whether the corpus size normalization is done right or not.

Oct 05 '20 14:10 tomassykora

Hi, I have the same issue, any help?

Mar 24 '22 13:03 rnajim

kenlm kenlm copied to clipboard

Computing perplexity on different sized text corpuses

kenlm
kenlm copied to clipboard