kenlm
kenlm copied to clipboard
Computing perplexity on different sized text corpuses
Hello, I'd like to compute perplexity on different text corpuses given an ngram computed with kenlm. I found in some old issues that --vocab_pad param should be used with a big number in similar situations. But I'm really not sure if I got it right and this is the situation.
Can I just compute ngram with lmpz with this option and then run query with the given ngram on the given text corpus? Or something else should be done? Currently it seems that the bigger the corpus the bigger the ppl which makes me think whether the corpus size normalization is done right or not.
Hi, I have the same issue, any help?