RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

Other language data

Open Dzg0309 opened this issue 1 year ago • 4 comments

Thank you very much for your work in providing such rich data to the open source community, I was wondering if there are any plans for release in other languages, such as Chinese? I think Chinese data is also a need for most people.

Dzg0309 avatar Dec 22 '23 00:12 Dzg0309

Hi @Dzg0309 -- currently we don't have plans to release data in other languages. However, if you want to create such a dataset (e.g. in Chinese), you can use the CCNet pipeline and the scripts in this repo to compute quality signals and deduplicate the corpus. Note that in other languages you will likely have to adapt the quality signals.

mauriceweber avatar Jan 05 '24 08:01 mauriceweber

Hi @Dzg0309 -- currently we don't have plans to release data in other languages. However, if you want to create such a dataset (e.g. in Chinese), you can use the CCNet pipeline and the scripts in this repo to compute quality signals and deduplicate the corpus. Note that in other languages you will likely have to adapt the quality signals.

Thank you very much for your reply. It is very difficult for us to filter Chinese data from the original large-scale CommonCrawl because we cannot handle such a large CC dump package. Is there a channel to obtain language-differentiated data? Chinese raw data? In this way, we can process and generate Chinese data based on CCNet and the library you provided.

Dzg0309 avatar Jan 09 '24 09:01 Dzg0309

@mauriceweber I am a faculty member at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. I am about to kick-off a project to apply these workflows to prepare the Arabic language subset with the goal of contributing the Arabic language subset to the next version of this dataset. Would there be interest in collaborating on this project? We have technical skills and plenty of compute so what we really need is general guidance if we get stuck.

@Dzg0309 depending on how much resources we need to use to prepare the Arabic data we may be able to also prepare the data for other languages.

davidrpugh avatar May 15 '24 09:05 davidrpugh

Hi @davidrpugh , awesome to hear that! I'm happy to provide any guidance you need and open for collaboration on this!:)

mauriceweber avatar May 16 '24 11:05 mauriceweber