Added BrWac dataset
Thank you for your contribution!
Please read https://www.tensorflow.org/datasets/contribute#pr_checklist to make sure your PR follows the guidelines.
Add Dataset
- Dataset Name: BrWac
- Issue Reference: 3642
dataset_info.jsonGist: https://gist.github.com/marcospiau/58e9b3528288b2fbc419dd86ec1f18b7#file-brwac_dataset_info-json
Description
The Brazilian Portuguese Web as Corpus is a large corpus constructed following the Wacky framework, which was made public for research purposes. The current corpus version, released in January 2017, is composed by 3.53 million documents and 2.68 billion tokens. In order to use this dataset, you must request access by filling the form in the official homepage. Please note that this resource is available solely for academic research purposes, and you agreed not to use it for any commercial applications.
Title and text fields are preprocessed using ftfy (Speer, 2019) Python library. PS.: Description is extracted from official homepage.
Checklist
- [X] Address all TODO's
- [X] Add alphabetized import to subdirectory's
__init__.py - [ ] Run
download_and_preparesuccessfully - [ ] Add checksums file
- [X] Properly cite in
BibTeXformat - [X] Add passing test(s)
- [X] Add test data
- [X] If using additional dependencies (e.g.
scipy), use lazy_imports (if applicable) - [ ] Add data generation script (if applicable)
- [X] Lint code
Regarding unchecked boxes in the checklist:
- this dataset used a manually downloaded file, and I was not able to generate the checksums files, even though I ran
tfds buildwith--register_checksumsflag - I didn't run
download_and_prepare, but built the dataset from the command line usingtfds build brwac --register_checksums --manual_dir=<DIRECTORY_WITH_MANUAL_DOWNLOAD>with success
Hello @marcospiau , and thank you for your contribution!
Datasets with manually downloaded files do not need checksum files.
We would be happy to merge your PR -- could you please resolve the conflict in the setup.py file before that?
This typically requires running:
# On your feature branch
git fetch origin master
git rebase origin/master
Hi guys, thanks for reviewing the code!
I've solved the conflicts on setup.py, please let me know if there is anything else I could help with.
Best, Marcos
Hi @ccl-core , could you please take a look and confirm everything is OK?
Best, Marcos
Hello @marcospiau , thank you for the heads-up!
I requested the manual dataset at the given homepage but I received an email that their server is down.
Did you also encounter a similar problem? As soon as I can access the data to put in the manual_dir I'll finish the testing and I'll be able to complete the onboarding process.
Hi @ccl-core,
I tested the form request just now and had the same problem. I will contact the dataset mantainers and get back to you as I have an answer.
Best, Marcos
Hi @ccl-core, I spoke to them, the servers are being migrated and should be OK by the end of next week. I'll let you know as soon as the manual download is working again. Best, Marcos
Thank you very much, @marcospiau !
Hi, @ccl-core. It took a little longer than expected, but the new links are already working. Could you please check if everything is OK now? PS.: the links are different from the previous ones, so the form needs to be filled out again.
Best, Marcos
Thank you @marcospiau , I'm having a look!
I was wondering whether the dependency on ftfy is really necessary here? Or would there be a workaround?
Hi, @ccl-core. Thanks for the review! This dependency is included because the raw text contains many errors due to mojibake. One could write code to replicate what this dependency does, but the final code would probably be very similar to ftfy; besides, the few existing large language models pretrained in Portuguese use BrWac with ftfy preprocessing, so I think it's a good idea to use it as default preprocessing. What do you think?
Dear @marcospiau , I understand.
By the way, it seems like the checksum.tsv file is still missing? See tensorflow_datasets/text/bool_q/checksums.tsv. as an example.
You can register the new checksums with tfds build --register_checksums
Thank you!
Hi @ccl-core! The link for downloading this dataset is available once a form is filled out, so a manual download is required. Is it possible to generate checksums for manually downloaded files?
Hi @marcospiau ! Yes, it is possible :)
See e.g. the kaggle_wit dataset as an example: https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/vision_language/wit_kaggle
Hi @ccl-core! I tried using the command provided, but only got an empty checksums.tsv file. Also, the example file you provided is an empty file (kaggle_wit). Can I manually create a checksums.tsv file?
I don't know if at the time I submitted my PR for the first the instructions were different from now, but I was informed that checksum files are not required for manually downloaded datasets.
Hello @marcospiau , and thank you for your contribution!
Datasets with manually downloaded files do not need checksum files.
We would be happy to merge your PR -- could you please resolve the conflict in the
setup.pyfile before that? This typically requires running:# On your feature branch git fetch origin master git rebase origin/master
Hi @ccl-core , just a heads-up! Can we proceed with the onboarding process?
Hi @ccl-core ! Just a heads up! Can we proceed?