1-billion-word-language-modeling-benchmark
1-billion-word-language-modeling-benchmark copied to clipboard
question on the corpus size / script
Hi, I got a question when looking at the "prepare data" script. I downloaded the news.20XX.en.shuffled data from 2007 to 2011 and it does not provide 2.9B words as said in the paper or readme page. it's far less. does this mean I need to download more monolingual data from WMT11 ? but if so it is not included in the script ?
The reason why I am asking is because I am trying to do the same thing for 2008-2015 an d I come up with 2.8B words and 2.6B words after dedup/sorting.
Also for the Interpolated KN 5-gram, was it just Srilm being used ? plain ? make big lm ?
thanks.
Hi Vince (?),
On Mon, Oct 3, 2016 at 2:14 AM, vince62s [email protected] wrote:
Hi, I got a question when looking at the "prepare data" script. I downloaded the news.20XX.en.shuffled data from 2007 to 2011 and it does not provide 2.9B words as said in the paper or readme page. it's far less. does this mean I need to download more monolingual data from WMT11 ? but if so it is not included in the script ?
IIRC, we also provide the pre-processed data, courtesy of Tony Robinson.
It's been a while, my memory is no better than what I wrote at: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation
Did you do that and failed to reproduce the benchmark data distributed at: http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz ?
The reason why I am asking is because I am trying to do the same thing for 2008-2015 an d I come up with 2.8B words and 2.6B words after dedup/sorting.
Also for the Interpolated KN 5-gram, was it just Srilm being used ? plain ? make big lm ?
Our own implementation of Interpolated KN, see description/references/results in http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36472.pdf and Table 4 comparing our back-off n-gram LM implementation against SRILM.
-Ciprian
Thanks Ciprian, especially for the last paper, interesting indeed.
As fas the source data are concerned, I figured it out through tthe Moses mailing list. I think that later on (in WMT15 release) the news shuffle data were already deduped.
Anyway, I'll try a run on newest data.
Thanks.