datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Added BrWac dataset

Open marcospiau opened this issue 3 years ago • 9 comments

Thank you for your contribution!

Please read https://www.tensorflow.org/datasets/contribute#pr_checklist to make sure your PR follows the guidelines.

Add Dataset

  • Dataset Name: BrWac
  • Issue Reference: 3642
  • dataset_info.json Gist: https://gist.github.com/marcospiau/58e9b3528288b2fbc419dd86ec1f18b7#file-brwac_dataset_info-json

Description

The Brazilian Portuguese Web as Corpus is a large corpus constructed following the Wacky framework, which was made public for research purposes. The current corpus version, released in January 2017, is composed by 3.53 million documents and 2.68 billion tokens. In order to use this dataset, you must request access by filling the form in the official homepage. Please note that this resource is available solely for academic research purposes, and you agreed not to use it for any commercial applications.

Title and text fields are preprocessed using ftfy (Speer, 2019) Python library. PS.: Description is extracted from official homepage.

Checklist

  • [X] Address all TODO's
  • [X] Add alphabetized import to subdirectory's __init__.py
  • [ ] Run download_and_prepare successfully
  • [ ] Add checksums file
  • [X] Properly cite in BibTeX format
  • [X] Add passing test(s)
  • [X] Add test data
  • [X] If using additional dependencies (e.g. scipy), use lazy_imports (if applicable)
  • [ ] Add data generation script (if applicable)
  • [X] Lint code

marcospiau avatar Apr 12 '22 03:04 marcospiau

Regarding unchecked boxes in the checklist:

  • this dataset used a manually downloaded file, and I was not able to generate the checksums files, even though I ran tfds build with --register_checksums flag
  • I didn't run download_and_prepare, but built the dataset from the command line using tfds build brwac --register_checksums --manual_dir=<DIRECTORY_WITH_MANUAL_DOWNLOAD> with success

marcospiau avatar Apr 12 '22 13:04 marcospiau

Hello @marcospiau , and thank you for your contribution!

Datasets with manually downloaded files do not need checksum files.

We would be happy to merge your PR -- could you please resolve the conflict in the setup.py file before that? This typically requires running:

# On your feature branch
git fetch origin master
git rebase origin/master

ccl-core avatar May 11 '22 08:05 ccl-core

Hi guys, thanks for reviewing the code!

I've solved the conflicts on setup.py, please let me know if there is anything else I could help with.

Best, Marcos

marcospiau avatar May 19 '22 19:05 marcospiau

Hi @ccl-core , could you please take a look and confirm everything is OK?

Best, Marcos

marcospiau avatar May 30 '22 18:05 marcospiau

Hello @marcospiau , thank you for the heads-up!

I requested the manual dataset at the given homepage but I received an email that their server is down. Did you also encounter a similar problem? As soon as I can access the data to put in the manual_dir I'll finish the testing and I'll be able to complete the onboarding process.

ccl-core avatar Jun 29 '22 14:06 ccl-core

Hi @ccl-core,

I tested the form request just now and had the same problem. I will contact the dataset mantainers and get back to you as I have an answer.

Best, Marcos

marcospiau avatar Jun 29 '22 16:06 marcospiau

Hi @ccl-core, I spoke to them, the servers are being migrated and should be OK by the end of next week. I'll let you know as soon as the manual download is working again. Best, Marcos

marcospiau avatar Jul 05 '22 01:07 marcospiau

Thank you very much, @marcospiau !

ccl-core avatar Jul 06 '22 08:07 ccl-core

Hi, @ccl-core. It took a little longer than expected, but the new links are already working. Could you please check if everything is OK now? PS.: the links are different from the previous ones, so the form needs to be filled out again.

Best, Marcos

marcospiau avatar Sep 14 '22 22:09 marcospiau

Thank you @marcospiau , I'm having a look!

I was wondering whether the dependency on ftfy is really necessary here? Or would there be a workaround?

ccl-core avatar Sep 29 '22 14:09 ccl-core

Hi, @ccl-core. Thanks for the review! This dependency is included because the raw text contains many errors due to mojibake. One could write code to replicate what this dependency does, but the final code would probably be very similar to ftfy; besides, the few existing large language models pretrained in Portuguese use BrWac with ftfy preprocessing, so I think it's a good idea to use it as default preprocessing. What do you think?

marcospiau avatar Sep 29 '22 15:09 marcospiau

Dear @marcospiau , I understand.

By the way, it seems like the checksum.tsv file is still missing? See tensorflow_datasets/text/bool_q/checksums.tsv. as an example.

You can register the new checksums with tfds build --register_checksums

Thank you!

ccl-core avatar Sep 30 '22 12:09 ccl-core

Hi @ccl-core! The link for downloading this dataset is available once a form is filled out, so a manual download is required. Is it possible to generate checksums for manually downloaded files?

marcospiau avatar Sep 30 '22 12:09 marcospiau

Hi @marcospiau ! Yes, it is possible :)

See e.g. the kaggle_wit dataset as an example: https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/vision_language/wit_kaggle

ccl-core avatar Sep 30 '22 13:09 ccl-core

Hi @ccl-core! I tried using the command provided, but only got an empty checksums.tsv file. Also, the example file you provided is an empty file (kaggle_wit). Can I manually create a checksums.tsv file?

I don't know if at the time I submitted my PR for the first the instructions were different from now, but I was informed that checksum files are not required for manually downloaded datasets.

Hello @marcospiau , and thank you for your contribution!

Datasets with manually downloaded files do not need checksum files.

We would be happy to merge your PR -- could you please resolve the conflict in the setup.py file before that? This typically requires running:

# On your feature branch
git fetch origin master
git rebase origin/master

marcospiau avatar Sep 30 '22 16:09 marcospiau

Hi @ccl-core , just a heads-up! Can we proceed with the onboarding process?

marcospiau avatar Oct 17 '22 03:10 marcospiau

Hi @ccl-core ! Just a heads up! Can we proceed?

marcospiau avatar Nov 29 '22 02:11 marcospiau