text
text copied to clipboard
Added WMT News Crawl Dataset for language modeling
Fixes #645
- Added WMT News Crawl dataset for language modeling
@zhangguanheng66 I added the 2011 dataset per your instructions - was unable to find the validation and test sets. What are your thoughts on this?
@zhangguanheng66 I noticed an issue in writing this PR.
zip and tar files are handled differently.
Assuming both .zip and .tar files are stored to .data, the output of filenames from extract_archive is different.
.zip output files such as - "en-us/train.txt"
.tar output files such as - ".data/en-us/train.txt"
This causes inconsistency in the _setup_datasets function as the root is being joined to the outputted filenames.
Is this by design? In my opinion, this should be corrected
If you feel there is a mistake in the download function, you could open a separate issue and/or PR to fix it. It's better to do it separately for good record.
@zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset given that there are over 2 million rows in the News Crawl dataset - it takes a while to tokenize the entire dataset.
@zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset given that there are over 2 million rows in the News Crawl dataset - it takes a while to tokenize the entire dataset.
Thanks for the contribution. I understand. It's a huge dataset.
How do you feel about the new abstract, compared with the old pattern? Any feedbacks?
@zhangguanheng66 this PR should be good to go. Let me know if you have any comments!
@zhangguanheng66 any thoughts on overloading the iter method for language modeling?
@zhangguanheng66 any thoughts on overloading the iter method for language modeling?
Ideally, the iter
method should be handled by DataLoader
, rather than torchtext. We want to eventually retire those legacy code, including batch
, split
.
Main reason for changes is due to the code below.
I can revert back all my code to the original and include a if/else in _setup_datasets function for non zip files
import os
from torchtext.utils import extract_archive, donwload_from_url
os.mkdir('.data')
download_from_url('http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz')
# '.data/validation.tar.gz'
download_from_url('https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip')
# '.data/wikitext-2-v1.zip'
extract_archive('.data/validation.tar.gz')
# ['.data/val.de', '.data/val.en']
extract_archive('.data/wikitext-2-v1.zip')
# ['wikitext-2/', 'wikitext-2/wiki.test.tokens', 'wikitext-2/wiki.valid.tokens', 'wikitext-2/wiki.train.tokens']
@cpuhrsch @zhangguanheng66
@anmolsjoshi - I think it makes sense to fix this separately and before merging this.
@zhangguanheng66 @cpuhrsch - I'll push a branch up later today fixing this tar/zip issue. And we can move forward with the WMT dataset after.
Thanks for your review!
@zhangguanheng66 @cpuhrsch - quick question before I proceed - are the filenames returned from zip or tar correct i.e. should root folder be prepended to the path?
I think the filenames from zip (root prepended) might be better as it is consistent with download_url function
@anmolsjoshi - I think we should always return the full path, because it gives more flexibility and the user knows exactly what's going on. So, if prepending root gives us that, that's what we should do.
~Closing in favor of #700~
@zhangguanheng66 @cpuhrsch I have incorporated changes requested in an earlier review and made some additional changes. Here is a summary:
- Removed any code related to extract_archive fix and moved it to #692
- Added a check for data_select for WMT News Crawl, it can only be train as other datasets are not provided
- Added tests to load the dataset and check if errors are raised for incorrect data_select
- Made corrections to spelling errors, mainly correcting tupel to tuple
It seems that there is an even larger Wikitext dataset like this one, https://dl.fbaipublicfiles.com/fairseq/data/wikipedia.en_filtered.gz Any thought?
Reading from the website, 2009 is the largest dataset.
- From Europarl (403MB) md5 sha1
- From the News Commentary corpus (41MB) md5 sha1
- From the News Crawl corpus (2007 only) (1.1 GB) md5
- From the News Crawl corpus (2008 only) (3.4 GB) md5
- From the News Crawl corpus (2009 only) (3.7 GB) md5
- From the News Crawl corpus (2010 only) (1.4 GB) md5
- From the News Crawl corpus (2011 only) (229 MB) md5 sha1
The reason I picked 2011 was due to your comment on #645
Reading from the website, 2009 is the largest dataset.
- From Europarl (403MB) md5 sha1
- From the News Commentary corpus (41MB) md5 sha1
- From the News Crawl corpus (2007 only) (1.1 GB) md5
- From the News Crawl corpus (2008 only) (3.4 GB) md5
- From the News Crawl corpus (2009 only) (3.7 GB) md5
- From the News Crawl corpus (2010 only) (1.4 GB) md5
- From the News Crawl corpus (2011 only) (229 MB) md5 sha1
The reason I picked 2011 was due to your comment on #645
I think it could be interesting to include at least one more larger Crawl corpus so users have the option to try both.
Is the idea to have multiple functions for different years' datasets or provide an argument for the year?
Is the idea to have multiple functions for different years' datasets or provide an argument for the year?
Correct me if I'm wrong @anmolsjoshi @cpuhrsch , I don't think there is a significant differences between years. We could just provide different Crawl corpus with different size. Users may want to train the models with different datasets depending on their memory.
Should I update the current dataset to 2009?
Which other datasets would you want to provide?
Maybe we could add one more argument (as you did for language) so user can explicitly choose the one they like. And in the docs, we clearly mark the number of tokens and the memory side for the corresponding dataset.
I'm using 2010 English one with 1.4G in total for my BERT model. The pipeline you built is flexible enough to switch between years/languages. Thanks a lot.
@zhangguanheng66 thanks for the comments.
I've added an option where users can pass the year and a table in the docstrings with details about the news crawl datasets by year.
Let me know what you think.
What are your thoughts on adding a check whether the provided year and language are valid?
@zhangguanheng66 I saw discussion in #691 and #690, code in #696 - Is there value in decoupling vocab and LanguageModelingDataset as well?
@anmolsjoshi We want to de-couple the vocab object from dataset but are not very sure the design. I will work on some cases and pull you guys for a look.
Thanks! Let me know if any other changes are needed on this PR!
Hi @anmolsjoshi!
Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.
You currently have a record in our system, but we do not have a signature on file.
In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!