kneser-ney ValueError: math domain error

I changed the pad_symbol as left_pad_symbol, right_pad_symbol and add start_pad_symbol in KneserNeyLM, but there still another eroor. We may use log function with a negative value,but why it was negative?

code: from nltk.corpus import gutenberg from nltk.util import ngrams from kneser_ney import KneserNeyLM

gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left=True, pad_right=True, right_pad_symbol='<s>',left_pad_symbol='<s>')) lm = KneserNeyLM(3, gut_ngrams,start_pad_symbol='<s>', end_pad_symbol='<s>') lm.score_sent(('This', 'is', 'a', 'sample', 'sentence', '.')) lm.generate_sentence()

ValueError Traceback (most recent call last) in () 2 ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, 3 pad_left=True, pad_right=True, right_pad_symbol='<s>',left_pad_symbol='<s>')) ----> 4 lm = KneserNeyLM(3, gut_ngrams,start_pad_symbol='<s>', end_pad_symbol='<s>') 5 lm.score_sent(('This', 'is', 'a', 'sample', 'sentence', '.')) 6 lm.generate_sentence()

in init(self, highest_order, ngrams, start_pad_symbol, end_pad_symbol) 21 self.start_pad_symbol = start_pad_symbol 22 self.end_pad_symbol = end_pad_symbol ---> 23 self.lm = self.train(ngrams) 24 25 def train(self, ngrams):

in train(self, ngrams) 30 """ 31 kgram_counts = self._calc_adj_counts(Counter(ngrams)) ---> 32 probs = self._calc_probs(kgram_counts) 33 return probs 34

in _calc_probs(self, orders) 62 backoffs = [] 63 for order in orders[:-1]: ---> 64 backoff = self._calc_order_backoff_probs(order) 65 backoffs.append(backoff) 66 orders[-1] = self._calc_unigram_probs(orders[-1])

in _calc_order_backoff_probs(self, order) 89 for key in order.keys(): 90 prefix = key[:-1] ---> 91 order[key] = math.log(order[key]/prefix_sums[prefix]) 92 for prefix in backoffs.keys(): 93 backoffs[prefix] = math.log(backoffs[prefix]/prefix_sums[prefix])

ValueError: math domain error

Jan 11 '17 02:01 wateryouyou

I have the same problem. Has this problem been solved?

Nov 17 '17 15:11 zparcheta

This error can be caused by having an input for which no single ngram appears 3 times. When this happens, the discount[2] = 2, which will lead to zero probability and break the math later.

There is already a check in the _calc_discounts function for cases where the discount goes negative, probably need another check for when they are too high.

Ultimately, its an issue of having an input dataset too small for this method to use.

Nov 18 '22 16:11 geugon