1-billion-word-language-modeling-benchmark icon indicating copy to clipboard operation
1-billion-word-language-modeling-benchmark copied to clipboard

Some Training Data Duplicated in Heldout Data

Open bjascob opened this issue 6 years ago • 10 comments

While using the preprocessed data from http://www.statmt.org/lm-benchmark/ I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to train/news.en-00000-of-00100 which appears to be a complete copy of all the heldout data.

Using a simple python script to put the sentences into a dict, I see 303,465 unique heldout sentences and 3,223 duplicates to sentences in the training directory. Attached is a file bw_duplicates.txt with the duplicates. You can easily verify this by grep'ing for them in the training directory.

Is this a known issue? My concern is that many people use this data for benchmarking language models and the test data has about 1% of the training data mixed into it. That's probably not going to change the results much but it isn't desirable either.

bjascob avatar Sep 10 '18 18:09 bjascob

Hi,

The training data is: 1-billion-word-language-modeling-benchmark/training-monolingual.tokenized.shuffled/news.en-000??-of-00100

As you will notice, the fileglob expansion is missing the news.en-00000-of-00100 file which is used as held-out data: 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100

Since that is a bit large, we sharded it 50-way, giving us 50 smaller sets for evaluation, parameter tuning, etc. The test set on which we reported results in the paper is: 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050

You can find more details on all this in the README files at: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

Hope this answers your questions, -Ciprian

On Mon, Sep 10, 2018 at 11:38 AM bjascob [email protected] wrote:

While using the preprocessed data from http://www.statmt.org/lm-benchmark/ http://url I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to train/news.en-00000-of-00100 which appears to be a complete copy of all the heldout data.

Using a simple python script to put the sentences into a dict, I see 303,465 unique heldout sentences and 3,223 duplicates to sentences in the training directory. Attached is a file bw_duplicates.txt https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/files/2367910/bw_duplicates.txt with the duplicates. You can easily verify this by grep'ing for them in the training directory.

Is this a known issue? My concern is that many people use this data for benchmarking language models and the test data has about 1% of the training data mixed into it. That's probably not going to change the results much but it isn't desirable either.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AK_u-2mmb3N5IkQVSLOqb1FwMxBxrNRRks5uZq_PgaJpZM4Wh8H7 .

-- -Ciprian

ciprian-chelba avatar Sep 10 '18 19:09 ciprian-chelba

Yes, that was my understanding.
I'm saying that sentences in the attached bw_duplicates.txt file show up in both, training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx is 1 to 99) For instance, the first duplicate sentence in the list "Bush is remembered by many Haitians -- " shows up in training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly the same. (Note that shards 59 and 75 also contain close to the same sentence but do differ by the "--" )

bjascob avatar Sep 10 '18 22:09 bjascob

On Mon, Sep 10, 2018 at 3:21 PM bjascob [email protected] wrote:

Yes, that was my understanding.

Sorry, I thought your concern was that somehow the entire held-out set is also part of the training set.

I'm saying that sentences in the attached bw_duplicates.txt file show up in both, training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx is 1 to 99) For instance, the first duplicate sentence in the list "Bush is remembered by many Haitians -- " shows up in training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly the same. (Note that shards 59 and 75 also contain close to the same sentence but do differ by the "--" )

Well, as I mention at: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/split-input-data.perl#L17 after tokenization I ran:

$ sort -u --parallel=10 news.20XX.en.shuffled.tokenized

--output=news.20XX.en.shuffled.tokenized.sorted

to get the input data.

A sample command line for running this:

./scripts/split-input-data.perl

--output_file_base="$PWD/training-monolingual.tokenized.shuffled.perl/news.en"

--num_shards=100

--input_file=./training-monolingual.tokenized/news.20XX.en.shuffled.tokenized.sorted

So the problem is in the Unix sort then?! Hard to believe. Perhaps what starts as different UTF-8 character sequences at sort time gets later normalized to the same sequence?! Not sure where the duplicates could come from...

This was originally done in MapReduce internally, but that data could not be released for legal concerns; I could only release code. As I explained at 3 in README.corpus_generation https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation, we went through a few iterations in making sure the results I was getting on my machine were the same as the ones that Tony got on his; I guess some bugs survived. :)

How much of the test set 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 overlaps with the training data?

Generally speaking, it was pointed out to me by users of this data that some level of overlap between training and test is to be expected in practice, especially at the short sentence end. So in that sense de-duping the data is not ideal either... But we started from a far worse situation.

-- -Ciprian

ciprian-chelba avatar Sep 10 '18 23:09 ciprian-chelba

On Mon, Sep 10, 2018 at 4:14 PM Ciprian Chelba (personal account) < [email protected]> wrote:

On Mon, Sep 10, 2018 at 3:21 PM bjascob [email protected] wrote:

Yes, that was my understanding.

Sorry, I thought your concern was that somehow the entire held-out set is also part of the training set.

I'm saying that sentences in the attached bw_duplicates.txt file show up in both, training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx is 1 to 99) For instance, the first duplicate sentence in the list "Bush is remembered by many Haitians -- " shows up in training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly the same. (Note that shards 59 and 75 also contain close to the same sentence but do differ by the "--" )

Well, as I mention at:

https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/split-input-data.perl#L17 after tokenization I ran:

$ sort -u --parallel=10 news.20XX.en.shuffled.tokenized

--output=news.20XX.en.shuffled.tokenized.sorted

to get the input data.

A sample command line for running this:

./scripts/split-input-data.perl

--output_file_base="$PWD/training-monolingual.tokenized.shuffled.perl/news.en"

--num_shards=100

--input_file=./training-monolingual.tokenized/news.20XX.en.shuffled.tokenized.sorted

So the problem is in the Unix sort then?! Hard to believe. Perhaps what starts as different UTF-8 character sequences at sort time gets later normalized to the same sequence?! Not sure where the duplicates could come from...

Nevermind, here is probably the reason:

Reading through: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/get-data.sh#L40 the unique sort of the data was done before running punctuation normalization and tokenization, so that explains the origin of the duplicates.

This was originally done in MapReduce internally, but that data could not be released for legal concerns; I could only release code. As I explained at 3 in README.corpus_generation https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation, we went through a few iterations in making sure the results I was getting on my machine were the same as the ones that Tony got on his; I guess some bugs survived. :)

How much of the test set 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 overlaps with the training data?

Generally speaking, it was pointed out to me by users of this data that some level of overlap between training and test is to be expected in practice, especially at the short sentence end. So in that sense de-duping the data is not ideal either... But we started from a far worse situation.

-- -Ciprian

-- -Ciprian

ciprian-chelba avatar Sep 10 '18 23:09 ciprian-chelba

On Mon, Sep 10, 2018 at 4:21 PM Ciprian Chelba (personal account) < [email protected]> wrote:

On Mon, Sep 10, 2018 at 4:14 PM Ciprian Chelba (personal account) < [email protected]> wrote:

On Mon, Sep 10, 2018 at 3:21 PM bjascob [email protected] wrote:

Yes, that was my understanding.

Sorry, I thought your concern was that somehow the entire held-out set is also part of the training set.

I'm saying that sentences in the attached bw_duplicates.txt file show up in both, training-monolingual.tokenized.shuffled/news.en-00000-of-00100 and training-monolingual.tokenized.shuffled/news.en-000xx-of-00100 (where xx is 1 to 99) For instance, the first duplicate sentence in the list "Bush is remembered by many Haitians -- " shows up in training/news.en-00000-of-00100 and training/news.en-00056-of-00100 exactly the same. (Note that shards 59 and 75 also contain close to the same sentence but do differ by the "--" )

Well, as I mention at:

https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/split-input-data.perl#L17 after tokenization I ran:

$ sort -u --parallel=10 news.20XX.en.shuffled.tokenized

--output=news.20XX.en.shuffled.tokenized.sorted

to get the input data.

A sample command line for running this:

./scripts/split-input-data.perl

--output_file_base="$PWD/training-monolingual.tokenized.shuffled.perl/news.en"

--num_shards=100

--input_file=./training-monolingual.tokenized/news.20XX.en.shuffled.tokenized.sorted

So the problem is in the Unix sort then?! Hard to believe. Perhaps what starts as different UTF-8 character sequences at sort time gets later normalized to the same sequence?! Not sure where the duplicates could come from...

Nevermind, here is probably the reason:

Reading through: https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/scripts/get-data.sh#L40 the unique sort of the data was done before running punctuation normalization and tokenization, so that explains the origin of the duplicates.

P.p.s. Looking at the history on get-data.sh, it was the last commit https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/commit/3780330c599223e8637ef306f7d14208f2faab66?diff=split that moved the sorting of the data before the normalization/tokenization. The description explains the decision:

Sort the date before doing the perl pre-processing.

This will make sure the training/held-out partitioning of the data is the same irrespective of how Perl handles Unicode peculiarities in the raw text.

So it seems that (other than typing "data" instead of "date") we could not have done it better. I am now relieved. :-)

Thanks for pointing this out! It would be great to know what percentage of the test set sentences are observed as such in the training data.

-Ciprian

So it seems that

This was originally done in MapReduce internally, but that data could not be released for legal concerns; I could only release code. As I explained at 3 in README.corpus_generation https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/blob/master/README.corpus_generation, we went through a few iterations in making sure the results I was getting on my machine were the same as the ones that Tony got on his; I guess some bugs survived. :)

How much of the test set 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 overlaps with the training data?

Generally speaking, it was pointed out to me by users of this data that some level of overlap between training and test is to be expected in practice, especially at the short sentence end. So in that sense de-duping the data is not ideal either... But we started from a far worse situation.

-- -Ciprian

-- -Ciprian

-- -Ciprian

ciprian-chelba avatar Sep 10 '18 23:09 ciprian-chelba

For all 50 holdout shards, I see 303,465 unique sentences and 3,223 duplicates to sentences in the training directory, so roughly 1%.

bjascob avatar Sep 10 '18 23:09 bjascob

On Mon, Sep 10, 2018 at 4:48 PM bjascob [email protected] wrote:

For all 50 holdout shards, I see 303,465 unique sentences and 3,223 duplicates to sentences to those in the training directory, so roughly 1%.

Would it hard to get the exact number for 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 only?

I could do it too, but if you have the script handy and are willing to re-run it...

-- -Ciprian

ciprian-chelba avatar Sep 10 '18 23:09 ciprian-chelba

On Mon, Sep 10, 2018 at 4:55 PM Ciprian Chelba (personal account) < [email protected]> wrote:

On Mon, Sep 10, 2018 at 4:48 PM bjascob [email protected] wrote:

For all 50 holdout shards, I see 303,465 unique sentences and 3,223 duplicates to sentences to those in the training directory, so roughly 1%.

Would it hard to get the exact number for 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 only?

Wrong copy/paste, sorry:

I meant: 1-billion-word-language-modeling-benchmark/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050

I could do it too, but if you have the script handy and are willing to re-run it...

-- -Ciprian

-- -Ciprian

ciprian-chelba avatar Sep 10 '18 23:09 ciprian-chelba

For news.en.heldout-00000-of-00050 there were 6,005 unique sentences and 70 that were duplicates of training data. The duplicates sentences are listed in the following file: bw_dup_shard0.txt.

bjascob avatar Sep 11 '18 00:09 bjascob

Thanks!

On Mon, Sep 10, 2018 at 5:14 PM bjascob [email protected] wrote:

For news.en.heldout-00000-of-00050 there were 6,005 unique sentences and 70 that were duplicates of training data. The duplicates sentences are listed in the following file: bw_dup_shard0.txt https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/files/2368867/bw_dup_shard0.txt .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark/issues/5#issuecomment-420102798, or mute the thread https://github.com/notifications/unsubscribe-auth/AK_u-4kX1z0wYO8mfSFqCp8ALe5Ytsqgks5uZwBTgaJpZM4Wh8H7 .

-- -Ciprian

ciprian-chelba avatar Sep 11 '18 00:09 ciprian-chelba