mteb
mteb copied to clipboard
Add a Benchmark for Asian Languages
Linguistic Families and Proposed Languages:
East Asian Languages
- [x] Chinese (Mandarin) - cmn
- [x] Cantonese - yue (#370)
- [x] Japanese - jpn
- [x] Korean - kor
- [ ] Mongolian - mon
South Asian Languages
Indic Languages:
-
[x] Hindi - hin
-
[x] Bengali - ben
-
[x] Punjabi - pan
-
[x] Marathi - mar
-
[x] Gujarati - guj
-
[x] Urdu - urd
-
[x] Nepali - nep
-
[x] Sinhala - sin
-
[x] Tamil - tam
-
[x] Telugu - tel
-
[x] Kannada - kan
-
[x] Malayalam - mal
-
Dravidian Languages:
- Included above (Tamil, Telugu, Kannada, Malayalam)
Southeast Asian Languages
- Austronesian Languages:
- [x] Indonesian - ind
- [x] Filipino - fil (#472 )
- [ ] Malay - msa
- [x] Javanese - jav
- Tai-Kadai Languages:
- [x] Thai - tha
- [x] Lao - lao
- Austroasiatic Languages:
- [x] Vietnamese - vie (see #364)
- [x] Khmer - khm
- [ ] Burmese - mya
Central Asian Languages
- Turkic Languages:
- [x] Kazakh - kaz
- [ ] Uzbek - uzb
- [ ] Turkmen - tkm
- [x] Kyrgyz - kir
- [x] Uighur - uig
West Asian (Middle Eastern) Languages
- Semitic Languages:
- [x] Arabic - ara
- [x] Hebrew - heb
- Iranian Languages:
- [x] Persian - fas
- [x] Kurdish - kur
- [ ] Pashto - pus
- [x] Dari - prs
Note this list does not claim to be comprehensive, do feel free to add to the list.
I will take a stab at a Bengali benchmark together with a colleague of mine 👍
Wonderful @rasdani feel free to create an issue on this as well so that others can see that you are working on it.
I created PRs for Indonesian languages (at least 10+ additions from 2 corpus) and African language. Once, they are approved, I can add the languages to the list.