the-pile
the-pile copied to clipboard
Scripts for dedup and filter Common Crawl?
Hi,
I notice that the download URL for the CommonCrawlDataset
is http://eaidata.bmk.sh/data/pile_cc_filtered_deduped.jsonl.zst
. In other words, this CC dataset is already deduplicated and filtered? However, it doesn't seem like https://github.com/leogao2/commoncrawl_downloader in the README included the scripts for deduplication and filtering. I'm wondering where I can find out exactly how deduplication and filtering for Pile CC is done?
Thanks!
Additional question: it seems like the_pile/pile.py
only downloads and interleave the data from various data sources. processing_scripts
contains many processing scripts, however, how do we know which script is supposed to be run on which data source, and how those scripts are supposed to be run?