cc_net icon indicating copy to clipboard operation
cc_net copied to clipboard

Tools to download and cleanup Common Crawl data

Results 27 cc_net issues
Sort by recently updated
recently updated
newest added

`python3 -m cc_net --config config/test_segment.json` finally: Regrouped test_data3/mined_by_lang/2019-09/en_head_0000.json.gz (1 / 3) Regrouped test_data3/mined_by_lang/2019-09/en_tail_0000.json.gz (2 / 3) Regrouped test_data3/mined_by_lang/2019-09/en_middle_0000.json.gz (3 / 3) but json files are not cleaned-up documents, they are:...

I want to crawl the latest 2023-06 snapshot data, how do I configure my stats.json? I notice that the json file has two tags, size and checksum. How do I...

In the paper, it is stated that CCNet conducted the study with the "common crawl snapshot in February 2019" dataset. I want to use the Common Crawl data snapshots collected...

While studying data pipelines, I found CCNet. CCNet is very intriguing to me. I'm going to use CCNet to create a better data pipeline for Korean datasets. I have a...

Traceback (most recent call last): File "E:\odoo 16\odoo source code\16.0\PyPDF2\_utils.py", line 53, in from typing import TypeAlias # type: ignore[attr-defined] ImportError: cannot import name 'TypeAlias' from 'typing' (C:\Users\HP\AppData\Local\Programs\Python\Python39\lib\typing.py) During handling...

Hi, when I run "python -m cc_net", this error happened: Submitting _hashes_shard in a job array (1600 jobs) sbatch: error: Batch job submission failed: Invalid job array specification subprocess.CalledProcessError: Command...

hi there, I encountered the 403 error while trying downloading ccnet data using this pipeline. Wondering if this is bcs of the network settings from my side or is there...

When Running the full pipeline with the newest dumps (e.g. 2020-34), there seem to be an issue with the header file format. It only seem to occur on Texts with...

Hello, I noticed that hash files that I've produced from the dump of January 21 (and several others months in 2020) are much smaller (x100) than hashes from dump of...

dev branch fails to start `_mine_shard` stage, it timeouts and rises following exception even with `parallelism=1`: ``` python3 -m cc_net --config reproduce --dump 2019-09 --task_parallelism 1 Will run cc_net.mine.main with...