data-preparation issues

Results 9 data-preparation issues

Sort by recently updated

Welcome to try SailCraft - A data cleaning tool built upon this repository

We extend our gratitude to the authors of this repository! Your documentation and code have greatly benefited the community. We have used this repo in building the data processing pipeline...

longxudou

Why stopwords_min_cutoff rather than stopwords_max_cutoff?

Thanks for your helpful codebase! I am a bit confused about `stop words filtering`. The release code removes the document, if its stop words ratio below the certain cutoff. https://github.com/bigscience-workshop/data-preparation/blob/9d0588419073cc5bf0fb92b58f37f2a1016572c3/preprocessing/training/01b_oscar_cleaning_and_filtering/filtering.py#L590...

longxudou

Changing parmater values to extreme in parameters_filtering.py doesn't change the no. f documents being removed

Hi! Kudos to the author for an end-to-end piepline for cleaning and filtering a large corpus. I was working with [main_filtering.py](https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/main_filtering.py) and was trying to change the parameter values in...

dk-github-acc

author = "bigscience-catalogue-lm-data", there is no this data in Huggingface.

when I run get_list_of_datasets.py ,i get "Number of datasets 0" ,this result means that there is no data in huggingface using this author,how can i get the data. Thanks for...

belle9217

Can't find the Deduplication Report

Thanks for your amazing codebase! I find that the link of [Deduplication Report](https://chenghaomou.github.io/1%20Projects/BigScience/SubProjects/Deduplication%20report) in `preprocessing/training/01b_oscar_cleaning_and_filtering/deduplicate/README.md` is not accessible. Could you please update it?

longxudou

Extending this codebase

I was looking at this codebase and encountered this bit: https://github.com/bigscience-workshop/data-preparation/tree/main/sourcing/code_dataset#code-dataset-sourcing ``` The query to create the dataset can be found in query.sql. After creation the dataset was preprocessed with...

chris-ha458

the version of simhash

Which version of simhash is used in the project, and why is the output of simhash.find_all() method always an empty list?

wang9702

Mismatch of the Available Data Quantity on Huggingface

I tried to download English part of Roots these days. According to the paper, there are 484,953,009,124 bytes of English data. However, after downloading all roots-related datasets on [huggingface](https://huggingface.co/datasets?language=language:en&sort=downloads&search=bigscience-data%2F) by...

cll-mtk