berkeleylm icon indicating copy to clipboard operation
berkeleylm copied to clipboard

Can I feed this library raw counts instead of text files, and have it compute the Kneser Ney probabilities for me?

Open GoogleCodeExporter opened this issue 9 years ago • 1 comments

If we have a very large corpus that I would like to take counts of in some 
distributed way, is there a way to give those raw counts to this code to build 
my model for me?

Original issue reported on code.google.com by [email protected] on 17 Jul 2013 at 7:27

GoogleCodeExporter avatar Jul 16 '15 16:07 GoogleCodeExporter

The answer is "sort of". There is code in place to estimate Kneser Ney 
probabilities from a Google-ngram-formatted corpus (see 
https://groups.google.com/forum/#!topic/berkeleylm-discuss/G6Ta2YTsAA0). 
However, there may be some bugs. But please try running it, and seeing what 
happens. If it crashes, I'll have extra incentive to fix it. 

Original comment by [email protected] on 17 Jul 2013 at 8:14

GoogleCodeExporter avatar Jul 16 '15 16:07 GoogleCodeExporter