cc_net
cc_net copied to clipboard
Inquiries about korean datasets utilized in the CCNet pipeline
While studying data pipelines, I found CCNet. CCNet is very intriguing to me. I'm going to use CCNet to create a better data pipeline for Korean datasets. I have a question. In the paper, it is stated that CCNet conducted a study with the "Feb. 2019 snapshot of Common Crawl" dataset. I wonder how many Korean datasets are in that dataset. In the paper, the size of the datasets in table 6 is written as the size after data preprocessing. I wonder if the data preprocessing is only deduplication. Also, I'm curious about the size of the Korean dataset before the data preprocessing. If you share the size of the Korean dataset, it will be of great help to me who is conducting research using CCNet.
Hi, the processing pipeline is mostly: CommonCrawl -> Deduplication at paragraph level -> Language detection -> optionally LM based filtering. In the case of Korean we did train a LM on Korean Wikipedia and kept only the top 30% of text according to this LM. The final number of "clean" Korean is reported in the paper.