kgrams_counts bug
Maybe it is trivial and I am wrong.
From the paper I think the count of a k-gram "word" is its occurrence in the corpus data not in its higher-order gram types. If this is the case, https://github.com/smilli/kneser-ney/blob/2740fba0aff5a6fba655a511c19590cb2c95af13/kneser_ney.py#L58 should be changed to
new_order[suffix] += last_order[ngram]
But even this is troublesome. For example, ('?', '', '') in the last_order will add its suffix ('', '') to the new_order. But I think two pad symbol is not valid in a bigram model.
Therefore, I think a better way to do kgram count is to do each order independently and directly from corpus data. And in the class KneserNeyLM definition, using highest_order gram ngrams as arg and in the example.py usinggut_ngrams need to be revised as well.
It is the adjusted count or unique prefix count for low order here. I think +1 is correct here.