kneser-ney icon indicating copy to clipboard operation
kneser-ney copied to clipboard

kgrams_counts bug

Open wzy816 opened this issue 6 years ago • 1 comments

Maybe it is trivial and I am wrong.

From the paper I think the count of a k-gram "word" is its occurrence in the corpus data not in its higher-order gram types. If this is the case, https://github.com/smilli/kneser-ney/blob/2740fba0aff5a6fba655a511c19590cb2c95af13/kneser_ney.py#L58 should be changed to

new_order[suffix] += last_order[ngram]

But even this is troublesome. For example, ('?', '', '') in the last_order will add its suffix ('', '') to the new_order. But I think two pad symbol is not valid in a bigram model.

Therefore, I think a better way to do kgram count is to do each order independently and directly from corpus data. And in the class KneserNeyLM definition, using highest_order gram ngrams as arg and in the example.py usinggut_ngrams need to be revised as well.

wzy816 avatar May 03 '19 09:05 wzy816

It is the adjusted count or unique prefix count for low order here. I think +1 is correct here.

robinn37 avatar Jun 26 '19 09:06 robinn37