1-billion-word-language-modeling-benchmark issues

Results 4 1-billion-word-language-modeling-benchmark issues

Sort by recently updated

if the word not in vocab， what should I do？ or it always can't happen because the FullTokenizer

Some Training Data Duplicated in Heldout Data

While using the preprocessed data from [http://www.statmt.org/lm-benchmark/](url) I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to _train/news.en-00000-of-00100_ which appears...

bjascob

Dead code.google.com link

I thought I'd point out that on the page that gives a link to this Github repo ([http://www.statmt.org/lm-benchmark/](http://www.statmt.org/lm-benchmark/)), there is a dead link (to the bash and perl scripts) linking...

charlesreid1

question on the corpus size / script

Hi, I got a question when looking at the "prepare data" script. I downloaded the news.20XX.en.shuffled data from 2007 to 2011 and it does not provide 2.9B words as said...

vince62s

1-billion-word-language-modeling-benchmark
1-billion-word-language-modeling-benchmark copied to clipboard

Metadata

if the word not in vocab， what should I do？ or it always can't happen because the FullTokenizer

Some Training Data Duplicated in Heldout Data

Dead code.google.com link

question on the corpus size / script

← Metadata

Owner

Metadata

1-billion-word-language-modeling-benchmark 1-billion-word-language-modeling-benchmark copied to clipboard

Metadata

if the word not in vocab， what should I do？ or it always can't happen because the FullTokenizer

Some Training Data Duplicated in Heldout Data

Dead code.google.com link

question on the corpus size / script

← Metadata

Owner

Metadata

1-billion-word-language-modeling-benchmark
1-billion-word-language-modeling-benchmark copied to clipboard