RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

Languages

Open firqaaa opened this issue 2 years ago • 4 comments

Can you mention what languages covered in this dataset? based on the arXiv:2302.13971v1, LLaMA only covers this kind of languages : bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. Is there possible to add some new low resources languages, like Indonesian for example. Thanks

firqaaa avatar Apr 19 '23 05:04 firqaaa

Same question.

kugwzk avatar Apr 19 '23 07:04 kugwzk

That's correct, we cover the same set of languages and these come from the wikipedia slice of the dataset. We will add support for more languages in the future (also low resource ones).

mauriceweber avatar Apr 19 '23 10:04 mauriceweber

That's correct, we cover the same set of languages and these come from the wikipedia slice of the dataset. We will add support for more languages in the future (also low resource ones).

Does the dataset currently contain Chinese resources?

hnnw avatar Apr 20 '23 03:04 hnnw

no, currently we only have the following languages in the dataset: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. We are planning to add support for more languages in the future.

mauriceweber avatar Apr 20 '23 06:04 mauriceweber