1-billion-word-language-modeling-benchmark
1-billion-word-language-modeling-benchmark copied to clipboard
Formerly known as code.google.com/p/1-billion-word-language-modeling-benchmark
While using the preprocessed data from [http://www.statmt.org/lm-benchmark/](url) I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to _train/news.en-00000-of-00100_ which appears...
I thought I'd point out that on the page that gives a link to this Github repo ([http://www.statmt.org/lm-benchmark/](http://www.statmt.org/lm-benchmark/)), there is a dead link (to the bash and perl scripts) linking...
Hi, I got a question when looking at the "prepare data" script. I downloaded the news.20XX.en.shuffled data from 2007 to 2011 and it does not provide 2.9B words as said...