berkeleylm icon indicating copy to clipboard operation
berkeleylm copied to clipboard

Unknown Values

Open ekravi opened this issue 9 years ago • 0 comments

Hi, I trained a very simple bigram model, using MakeKneserNeyArpaFromText class. The model included two strings - "hello world" and "hello bye". The following scores were retrieved: "x hello" -101.22185 "hello x" NaN "hello world" -0.87942696

After debugging, in the class ArrayEncodedProbBackoffLm<W>, in the method getLogProb, when examining the bigram "x " the condition in line 72 is passed, and NaN value is received in the loop afterwords. It seems that unknown values aren't taken care of when they appear in the last ngram.

Also, when checking the Arpa file, there is no specific value for unknown tokens, see below the content of the Arpa file.

\data
ngram 1=4 ngram 2=5

\1-grams: -0.698970 world -0.176091 -99.000000 -0.477121 -0.698970 hello -0.176091 -0.698970 bye -0.176091

\2-grams: -0.221849 world -0.134699 hello -0.522879 hello bye -0.522879 hello world -0.221849 bye

\end\

ekravi avatar Aug 10 '15 10:08 ekravi