olm-datasets icon indicating copy to clipboard operation
olm-datasets copied to clipboard

CC data Language Splits

Open KeremTurgutlu opened this issue 1 year ago • 3 comments

Thanks a lot for putting this repo together and providing the fresh CC dumps at HF. I was looking for a way to find dataset splits for other languages but couldn't find a way to do it. Are datasets olm/olm-CC-MAIN-* monolingual by chance?

KeremTurgutlu avatar Mar 02 '23 06:03 KeremTurgutlu

They have only processed and uploaded en text from WET files processed by get_text_dataset_from_wet_downloads.py script. That script is using fastText language identification model to identify text language and will save the text generated from WET files into individual languages directories.

spate141 avatar Mar 02 '23 14:03 spate141

Thanks for the reply. I was just surprised not to see other languages since they are already processed in the code like you've mentioned. I couldn't find any language specific filter in this code. Also, here it looks like all language ids are uploaded https://github.com/huggingface/olm-datasets/blob/535c2c9250539cf3277d74e2ff664ba98c1ca033/pipeline_scripts/common_crawl/get_text_dataset_from_wet_downloads.py#L96. Maybe later they decided and manually uploaded only en?

It actually gets filtered in bloom filter stage where there is a lang id arg. So the previous uploads of all languages (in the first stage) were not public I assume.

@TristanThrush would it be possible to make other language splits public if they are readily available? If not would it be possible in future snapshot jobs? Thanks✌️

KeremTurgutlu avatar Mar 02 '23 18:03 KeremTurgutlu

Yes, there are no filters. Files for other languages (120+) are being generated using that script in their respective directories. Not sure why they were not made available. Maybe because they still require you to apply bloom filter and it's resource expensive.

spate141 avatar Mar 03 '23 15:03 spate141