1-billion-word-language-modeling-benchmark question on the corpus size / script

Hi, I got a question when looking at the "prepare data" script. I downloaded the news.20XX.en.shuffled data from 2007 to 2011 and it does not provide 2.9B words as said in the paper or readme page. it's far less. does this mean I need to download more monolingual data from WMT11 ? but if so it is not included in the script ?

The reason why I am asking is because I am trying to do the same thing for 2008-2015 an d I come up with 2.8B words and 2.6B words after dedup/sorting.

Also for the Interpolated KN 5-gram, was it just Srilm being used ? plain ? make big lm ?

thanks.

Oct 03 '16 09:10 vince62s

Hi Vince (?),

On Mon, Oct 3, 2016 at 2:14 AM, vince62s [email protected] wrote:

Hi, I got a question when looking at the "prepare data" script. I downloaded the news.20XX.en.shuffled data from 2007 to 2011 and it does not provide 2.9B words as said in the paper or readme page. it's far less. does this mean I need to download more monolingual data from WMT11 ? but if so it is not included in the script ?

IIRC, we also provide the pre-processed data, courtesy of Tony Robinson.

It's been a while, my memory is no better than what I wrote at: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation

Did you do that and failed to reproduce the benchmark data distributed at: http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz ?

The reason why I am asking is because I am trying to do the same thing for 2008-2015 an d I come up with 2.8B words and 2.6B words after dedup/sorting.

Also for the Interpolated KN 5-gram, was it just Srilm being used ? plain ? make big lm ?

Our own implementation of Interpolated KN, see description/references/results in http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36472.pdf and Table 4 comparing our back-off n-gram LM implementation against SRILM.

-Ciprian

Oct 03 '16 18:10 ciprian-chelba

Thanks Ciprian, especially for the last paper, interesting indeed.

As fas the source data are concerned, I figured it out through tthe Moses mailing list. I think that later on (in WMT15 release) the news shuffle data were already deduped.

Anyway, I'll try a run on newest data.

Thanks.

Oct 05 '16 10:10 vince62s

1-billion-word-language-modeling-benchmark 1-billion-word-language-modeling-benchmark copied to clipboard

question on the corpus size / script

1-billion-word-language-modeling-benchmark
1-billion-word-language-modeling-benchmark copied to clipboard