the-pile icon indicating copy to clipboard operation
the-pile copied to clipboard

Scripts for dedup and filter Common Crawl?

Open shangw-nvidia opened this issue 3 years ago • 1 comments

Hi,

I notice that the download URL for the CommonCrawlDataset is http://eaidata.bmk.sh/data/pile_cc_filtered_deduped.jsonl.zst. In other words, this CC dataset is already deduplicated and filtered? However, it doesn't seem like https://github.com/leogao2/commoncrawl_downloader in the README included the scripts for deduplication and filtering. I'm wondering where I can find out exactly how deduplication and filtering for Pile CC is done?

Thanks!

shangw-nvidia avatar Feb 24 '22 20:02 shangw-nvidia

Additional question: it seems like the_pile/pile.py only downloads and interleave the data from various data sources. processing_scripts contains many processing scripts, however, how do we know which script is supposed to be run on which data source, and how those scripts are supposed to be run?

shangw-nvidia avatar Feb 24 '22 20:02 shangw-nvidia