text icon indicating copy to clipboard operation
text copied to clipboard

Added WMT News Crawl Dataset for language modeling

Open anmolsjoshi opened this issue 5 years ago • 26 comments

Fixes #645

  • Added WMT News Crawl dataset for language modeling

anmolsjoshi avatar Feb 03 '20 06:02 anmolsjoshi

@zhangguanheng66 I added the 2011 dataset per your instructions - was unable to find the validation and test sets. What are your thoughts on this?

anmolsjoshi avatar Feb 03 '20 06:02 anmolsjoshi

@zhangguanheng66 I noticed an issue in writing this PR.

zip and tar files are handled differently.

Assuming both .zip and .tar files are stored to .data, the output of filenames from extract_archive is different.

.zip output files such as - "en-us/train.txt"

.tar output files such as - ".data/en-us/train.txt"

This causes inconsistency in the _setup_datasets function as the root is being joined to the outputted filenames.

Is this by design? In my opinion, this should be corrected

If you feel there is a mistake in the download function, you could open a separate issue and/or PR to fix it. It's better to do it separately for good record.

anmolsjoshi avatar Feb 04 '20 02:02 anmolsjoshi

@zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset given that there are over 2 million rows in the News Crawl dataset - it takes a while to tokenize the entire dataset.

anmolsjoshi avatar Feb 04 '20 02:02 anmolsjoshi

@zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset given that there are over 2 million rows in the News Crawl dataset - it takes a while to tokenize the entire dataset.

Thanks for the contribution. I understand. It's a huge dataset.

How do you feel about the new abstract, compared with the old pattern? Any feedbacks?

zhangguanheng66 avatar Feb 04 '20 15:02 zhangguanheng66

@zhangguanheng66 this PR should be good to go. Let me know if you have any comments!

anmolsjoshi avatar Feb 05 '20 05:02 anmolsjoshi

@zhangguanheng66 any thoughts on overloading the iter method for language modeling?

anmolsjoshi avatar Feb 05 '20 15:02 anmolsjoshi

@zhangguanheng66 any thoughts on overloading the iter method for language modeling?

Ideally, the iter method should be handled by DataLoader, rather than torchtext. We want to eventually retire those legacy code, including batch, split.

zhangguanheng66 avatar Feb 05 '20 16:02 zhangguanheng66

Main reason for changes is due to the code below.

I can revert back all my code to the original and include a if/else in _setup_datasets function for non zip files

import os
from torchtext.utils import extract_archive, donwload_from_url

os.mkdir('.data')
download_from_url('http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz')
# '.data/validation.tar.gz'

download_from_url('https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip')
# '.data/wikitext-2-v1.zip'

extract_archive('.data/validation.tar.gz')
# ['.data/val.de', '.data/val.en']

extract_archive('.data/wikitext-2-v1.zip')
# ['wikitext-2/', 'wikitext-2/wiki.test.tokens', 'wikitext-2/wiki.valid.tokens', 'wikitext-2/wiki.train.tokens']

@cpuhrsch @zhangguanheng66

anmolsjoshi avatar Feb 05 '20 18:02 anmolsjoshi

@anmolsjoshi - I think it makes sense to fix this separately and before merging this.

cpuhrsch avatar Feb 05 '20 21:02 cpuhrsch

@zhangguanheng66 @cpuhrsch - I'll push a branch up later today fixing this tar/zip issue. And we can move forward with the WMT dataset after.

Thanks for your review!

anmolsjoshi avatar Feb 05 '20 21:02 anmolsjoshi

@zhangguanheng66 @cpuhrsch - quick question before I proceed - are the filenames returned from zip or tar correct i.e. should root folder be prepended to the path?

I think the filenames from zip (root prepended) might be better as it is consistent with download_url function

anmolsjoshi avatar Feb 05 '20 22:02 anmolsjoshi

@anmolsjoshi - I think we should always return the full path, because it gives more flexibility and the user knows exactly what's going on. So, if prepending root gives us that, that's what we should do.

cpuhrsch avatar Feb 06 '20 16:02 cpuhrsch

~Closing in favor of #700~

anmolsjoshi avatar Feb 24 '20 21:02 anmolsjoshi

@zhangguanheng66 @cpuhrsch I have incorporated changes requested in an earlier review and made some additional changes. Here is a summary:

  • Removed any code related to extract_archive fix and moved it to #692
  • Added a check for data_select for WMT News Crawl, it can only be train as other datasets are not provided
  • Added tests to load the dataset and check if errors are raised for incorrect data_select
  • Made corrections to spelling errors, mainly correcting tupel to tuple

anmolsjoshi avatar Feb 25 '20 02:02 anmolsjoshi

It seems that there is an even larger Wikitext dataset like this one, https://dl.fbaipublicfiles.com/fairseq/data/wikipedia.en_filtered.gz Any thought?

zhangguanheng66 avatar Feb 25 '20 20:02 zhangguanheng66

Reading from the website, 2009 is the largest dataset.

  • From Europarl (403MB) md5 sha1
  • From the News Commentary corpus (41MB) md5 sha1
  • From the News Crawl corpus (2007 only) (1.1 GB) md5
  • From the News Crawl corpus (2008 only) (3.4 GB) md5
  • From the News Crawl corpus (2009 only) (3.7 GB) md5
  • From the News Crawl corpus (2010 only) (1.4 GB) md5
  • From the News Crawl corpus (2011 only) (229 MB) md5 sha1

The reason I picked 2011 was due to your comment on #645

anmolsjoshi avatar Feb 25 '20 20:02 anmolsjoshi

Reading from the website, 2009 is the largest dataset.

  • From Europarl (403MB) md5 sha1
  • From the News Commentary corpus (41MB) md5 sha1
  • From the News Crawl corpus (2007 only) (1.1 GB) md5
  • From the News Crawl corpus (2008 only) (3.4 GB) md5
  • From the News Crawl corpus (2009 only) (3.7 GB) md5
  • From the News Crawl corpus (2010 only) (1.4 GB) md5
  • From the News Crawl corpus (2011 only) (229 MB) md5 sha1

The reason I picked 2011 was due to your comment on #645

I think it could be interesting to include at least one more larger Crawl corpus so users have the option to try both.

zhangguanheng66 avatar Feb 25 '20 20:02 zhangguanheng66

Is the idea to have multiple functions for different years' datasets or provide an argument for the year?

anmolsjoshi avatar Feb 25 '20 20:02 anmolsjoshi

Is the idea to have multiple functions for different years' datasets or provide an argument for the year?

Correct me if I'm wrong @anmolsjoshi @cpuhrsch , I don't think there is a significant differences between years. We could just provide different Crawl corpus with different size. Users may want to train the models with different datasets depending on their memory.

zhangguanheng66 avatar Feb 25 '20 22:02 zhangguanheng66

Should I update the current dataset to 2009?

Which other datasets would you want to provide?

anmolsjoshi avatar Feb 25 '20 22:02 anmolsjoshi

Maybe we could add one more argument (as you did for language) so user can explicitly choose the one they like. And in the docs, we clearly mark the number of tokens and the memory side for the corresponding dataset.

I'm using 2010 English one with 1.4G in total for my BERT model. The pipeline you built is flexible enough to switch between years/languages. Thanks a lot.

zhangguanheng66 avatar Feb 25 '20 23:02 zhangguanheng66

@zhangguanheng66 thanks for the comments.

I've added an option where users can pass the year and a table in the docstrings with details about the news crawl datasets by year.

Let me know what you think.

What are your thoughts on adding a check whether the provided year and language are valid?

anmolsjoshi avatar Feb 26 '20 18:02 anmolsjoshi

@zhangguanheng66 I saw discussion in #691 and #690, code in #696 - Is there value in decoupling vocab and LanguageModelingDataset as well?

anmolsjoshi avatar Feb 27 '20 04:02 anmolsjoshi

@anmolsjoshi We want to de-couple the vocab object from dataset but are not very sure the design. I will work on some cases and pull you guys for a look.

zhangguanheng66 avatar Feb 27 '20 16:02 zhangguanheng66

Thanks! Let me know if any other changes are needed on this PR!

anmolsjoshi avatar Feb 27 '20 16:02 anmolsjoshi

Hi @anmolsjoshi!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Oct 30 '20 17:10 facebook-github-bot