cc_net icon indicating copy to clipboard operation
cc_net copied to clipboard

Inquiries about korean datasets utilized in the CCNet pipeline

Open hyunmokky opened this issue 1 year ago • 1 comments

While studying data pipelines, I found CCNet. CCNet is very intriguing to me. I'm going to use CCNet to create a better data pipeline for Korean datasets. I have a question. In the paper, it is stated that CCNet conducted a study with the "Feb. 2019 snapshot of Common Crawl" dataset. I wonder how many Korean datasets are in that dataset. In the paper, the size of the datasets in table 6 is written as the size after data preprocessing. I wonder if the data preprocessing is only deduplication. Also, I'm curious about the size of the Korean dataset before the data preprocessing. If you share the size of the Korean dataset, it will be of great help to me who is conducting research using CCNet.

hyunmokky avatar Mar 08 '23 04:03 hyunmokky

Hi, the processing pipeline is mostly: CommonCrawl -> Deduplication at paragraph level -> Language detection -> optionally LM based filtering. In the case of Korean we did train a LM on Korean Wikipedia and kept only the top 30% of text according to this LM. The final number of "clean" Korean is reported in the paper.

gwenzek avatar Mar 08 '23 08:03 gwenzek