berkeleylm
berkeleylm copied to clipboard
Unknown Values
Hi, I trained a very simple bigram model, using MakeKneserNeyArpaFromText class. The model included two strings - "hello world" and "hello bye". The following scores were retrieved: "x hello" -101.22185 "hello x" NaN "hello world" -0.87942696
After debugging, in the class ArrayEncodedProbBackoffLm<W>, in the method getLogProb, when examining the bigram "x " the condition in line 72 is passed, and NaN value is received in the loop afterwords. It seems that unknown values aren't taken care of when they appear in the last ngram.
Also, when checking the Arpa file, there is no specific value for unknown tokens, see below the content of the Arpa file.
\data
ngram 1=4
ngram 2=5
\1-grams:
-0.698970 world -0.176091
-99.000000 -0.477121
-0.698970 hello -0.176091
-0.698970 bye -0.176091
\2-grams:
-0.221849 world
-0.134699 hello
-0.522879 hello bye
-0.522879 hello world
-0.221849 bye
\end\