data-preparation icon indicating copy to clipboard operation
data-preparation copied to clipboard

Code used for sourcing and cleaning the BigScience ROOTS corpus

Results 9 data-preparation issues
Sort by recently updated
recently updated
newest added

We extend our gratitude to the authors of this repository! Your documentation and code have greatly benefited the community. We have used this repo in building the data processing pipeline...

Thanks for your helpful codebase! I am a bit confused about `stop words filtering`. The release code removes the document, if its stop words ratio below the certain cutoff. https://github.com/bigscience-workshop/data-preparation/blob/9d0588419073cc5bf0fb92b58f37f2a1016572c3/preprocessing/training/01b_oscar_cleaning_and_filtering/filtering.py#L590...

Hi! Kudos to the author for an end-to-end piepline for cleaning and filtering a large corpus. I was working with [main_filtering.py](https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/main_filtering.py) and was trying to change the parameter values in...

when I run get_list_of_datasets.py ,i get "Number of datasets 0" ,this result means that there is no data in huggingface using this author,how can i get the data. Thanks for...

Thanks for your amazing codebase! I find that the link of [Deduplication Report](https://chenghaomou.github.io/1%20Projects/BigScience/SubProjects/Deduplication%20report) in `preprocessing/training/01b_oscar_cleaning_and_filtering/deduplicate/README.md` is not accessible. Could you please update it?

I was looking at this codebase and encountered this bit: https://github.com/bigscience-workshop/data-preparation/tree/main/sourcing/code_dataset#code-dataset-sourcing ``` The query to create the dataset can be found in query.sql. After creation the dataset was preprocessed with...

Which version of simhash is used in the project, and why is the output of simhash.find_all() method always an empty list?

I tried to download English part of Roots these days. According to the paper, there are 484,953,009,124 bytes of English data. However, after downloading all roots-related datasets on [huggingface](https://huggingface.co/datasets?language=language:en&sort=downloads&search=bigscience-data%2F) by...

We need to write a bunch of readmes on how we used each tools.