vpyp icon indicating copy to clipboard operation
vpyp copied to clipboard

Why is the sentence count also added when calculating perplexity?

Open x-ji opened this issue 6 years ago • 1 comments

I noticed that the perplexity is calculated in the following way:

ppl = math.exp(-ll/(n_sentences + n_words - n_oovs))

in eval.py

However, in the book Speech and Language Processing https://web.stanford.edu/~jurafsky/slp3/4.pdf, the perplexity is defined as

The perplexity of a language model on a test set is the inverse probability of the test set, normalized by the number of words.

I'm not sure why the sentence count is also added in addition to the number of words.

Also, I believe the normal treatment of oovs is to put them all as an <UNK> token and train the n-gram on that token as well. Is this just a simplification in vpyp?

Thank you.

x-ji avatar Jun 01 '18 18:06 x-ji

OK I can understand the n_oovs part as normally one would already have substituted low-frequency words with <UNK> in the preprocessing step, and vpyp is not responsible for that. This I can understand, though I still haven't grasped the n_sentences part...

x-ji avatar Jun 03 '18 20:06 x-ji