olm-datasets
olm-datasets copied to clipboard
CC data Language Splits
Thanks a lot for putting this repo together and providing the fresh CC dumps at HF. I was looking for a way to find dataset splits for other languages but couldn't find a way to do it. Are datasets olm/olm-CC-MAIN-*
monolingual by chance?
They have only processed and uploaded en
text from WET files processed by get_text_dataset_from_wet_downloads.py
script. That script is using fastText language identification model to identify text language and will save the text generated from WET files into individual languages directories.
Thanks for the reply. I was just surprised not to see other languages since they are already processed in the code like you've mentioned. I couldn't find any language specific filter in this code. Also, here it looks like all language ids are uploaded https://github.com/huggingface/olm-datasets/blob/535c2c9250539cf3277d74e2ff664ba98c1ca033/pipeline_scripts/common_crawl/get_text_dataset_from_wet_downloads.py#L96. Maybe later they decided and manually uploaded only en?
It actually gets filtered in bloom filter stage where there is a lang id arg. So the previous uploads of all languages (in the first stage) were not public I assume.
@TristanThrush would it be possible to make other language splits public if they are readily available? If not would it be possible in future snapshot jobs? Thanks✌️
Yes, there are no filters. Files for other languages (120+) are being generated using that script in their respective directories. Not sure why they were not made available. Maybe because they still require you to apply bloom filter and it's resource expensive.