lighteval [EVAL] Add TUMLU benchmark

Hello! We just released the benchmark for Turkic languages. Does it make sense if I add it to lighteval?

Evaluation short description

Why is this evaluation interesting? First native-language MMLU benchmark for low-resource Turkic languages.
How is it used in the community? Just released, MC high-school exam questions

Evaluation metadata

Provide all available

Paper url: https://arxiv.org/abs/2502.11020
Github url: https://github.com/ceferisbarov/TUMLU
Dataset url:

Feb 19 '25 15:02 gaydmi

cc @hynky1999 could interest you I feel!

Feb 19 '25 16:02 clefourrier

Is the dataset already on Hugging Face?

Feb 19 '25 16:02 clefourrier

@clefourrier Not really (in gated repos), but everything is in github already.

Feb 19 '25 16:02 gaydmi

Gated sounds fine, can you share the path?

Feb 19 '25 16:02 clefourrier

Hi, I think it would be very nice addition, we already have TurkishMMLU (which I think is is also part of your dataset right ?)

See https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/multilingual/tasks.py#L2133

To add it we would need following:

Have translation literals for the languages you want to add: (https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/multilingual/tasks.py#L2133)
Add the dataset to hub
Replace the TurkishMMLU with your dataset

Do you think you could do that? cc @gaydmi

Feb 21 '25 15:02 hynky1999

@gaydmi Thank you for bringing this up!

@hynky1999 I have a question. Our dataset can be split into subsets in three ways: (a) make each language a subset, (b) make each subject a subset, (c) make each language-subject combination a subset. Which one would you suggest? I could not find any similar examples in the repo.

Feb 23 '25 08:02 ceferisbarov

@hynky1999 Hi, yes, working on it! @ceferisbarov I personally think option (c) is the best, so we could just add new languages with their tasks. Like in here: https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/multilingual/tasks.py#L2617

Feb 24 '25 22:02 gaydmi

I would say ideally use subset for languages and then add column to identify the actuall task subset. You can then use hf_filter arg on task

Feb 24 '25 22:02 hynky1999

Both options sound good to me. I have added the dataset to Hugging Face:

https://huggingface.co/datasets/jafarisbarov/TUMLU-mini

@gaydmi let me know if I can help in any other way.

Feb 25 '25 20:02 ceferisbarov

Awesome, cc @gaydmi happy to review the PR once ready

Feb 26 '25 12:02 hynky1999