Add Support for ArabicMMLU Evaluation Task

Overview

This PR introduces a new evaluation task for Arabic Language Models (LLMs) using the ArabicMMLU dataset, as detailed in the paper "ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic". The ArabicMMLU dataset provides a comprehensive benchmark for evaluating the performance of LLMs on a wide range of tasks in the Arabic language.

Related Work

ArabicMMLU Dataset: Hugging Face
Paper: ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Notes

This contribution aligns with the ongoing efforts to expand the capabilities of lightEval in supporting diverse languages and tasks. Feedback and suggestions are welcome!

Jul 03 '24 09:07 bakrianoo

Hi, thanks for your PR! FYI, we have a small backlog of PRs to go through so we might take about a week to address it

Jul 04 '24 09:07 clefourrier

In the meantime, please make sure that the styling is correct :)

Jul 04 '24 13:07 clefourrier

@NathanHB Could be worth waiting for #214 (last PR of the above serie) before editing this one to fit the new format

Jul 09 '24 13:07 clefourrier

Hi! I think once you update the PR to the new format for metrics, prompts and functions, we'll be good to go! Also tagging @alielfilali01 since he was the author of the original file for arabic_evals (these are behind the arabic LLM leaderboard) to get his opinion too.

Jul 17 '24 14:07 clefourrier

Tnx @clefourrier for the tag and thanks dear @bakrianoo for your valuable contribution 🤗

I have one remark to make related to the comment i left above : hf_subset="default" will load the default subset which is also the test subset used for the eval and few shots ! Solution for me would be to drop eval subset and never use few shots in this benchmark ! OR Make a custom version of this dataset with test and val subsets in it !?

Jul 18 '24 01:07 alielfilali01

I agree with the comment - if you can setup your dataset to have different splits for few shots, it will avoid context contamination. You also need to run the linters to get code quality

Jul 18 '24 07:07 clefourrier

Thank you @alielfilali01 for your comment. I am wondering if creating a new dataset would violate any license issues for the original dataset!

I need to confirm with the dataset authors, or we can follow the zer-shot suggestion.

What do you think?

Jul 18 '24 10:07 bakrianoo

@bakrianoo , If it's not Apache2.0 for example then let's open a discussion in the repo and see if the authors would help with creating the dataset themselves. Plz feel free to do it and if no response then i can reach out directly to one or two of the authors... What do you think?

Jul 18 '24 10:07 alielfilali01

Sure. I will start the discussion there. Thank you @alielfilali01 for your interesting.

Jul 18 '24 10:07 bakrianoo

Hi! Feel free to tell us when this is updated!

Aug 08 '24 07:08 clefourrier

I think it's solved by this pr https://github.com/huggingface/lighteval/pull/338. See mmlu_ara_mcf

If you want to use native arabic letter as options anchors use:

formulation=MCFFormulation("NativeLetters")

as formulation for the task

cc @clefourrier

Oct 02 '24 15:10 hynky1999

I think it's solved by this pr #338. See mmlu_ara_mcf

cc @clefourrier

@hynky1999 can you plz mention the exact name from the pr you just mentioned? Cuz the mmlu here is different than the other mmlus and couldn't find it in the list from pr #338. Also this pr is a stale now and feel free to close since I'm planning to merge this version of mmlu in an upcoming pr myb next week alongside other new arabic tasks

Oct 02 '24 15:10 alielfilali01

I think it's solved by this pr #338. See mmlu_ara_mcf cc @clefourrier

@hynky1999 can you plz mention the exact name from the pr you just mentioned? Cuz the mmlu here is different than the other mmlus and couldn't find it in the list from pr #338. Also this pr is a stale now and feel free to close since I'm planning to merge this version of mmlu in an upcoming pr myb next week alongside other new arabic tasks

Sorry just saw you mentioned "mmlu_ara_mcf" ma bad. I see so many commits in the pr 😅 i will need to open it from laptop. I get back to you on it tomorrow at max so i can plan to remove it from my pr if all is good. Tnx man for taking care of it

Oct 02 '24 15:10 alielfilali01

So many commits are because of how we we were merging the prs hhhhh

Oct 02 '24 15:10 hynky1999

What tasks are you planning to add btw ?

Oct 02 '24 15:10 hynky1999

What tasks are you planning to add btw ?

Some new benchmarks we got from some colleagues and partners and convinced them to make them public 😁

Oct 02 '24 15:10 alielfilali01

Hey @hynky1999, Sorry couldn't get back to you yesterday. Well i saw the arabic mmlu task and it seems good. I'm just not sure about the instruction if you can provide more details on that. Also i saw the hf_repo is yazeed7/ArabicMMLU where the official release is MBZUAI/ArabicMMLU if you can change that plz.

In my upcoming pr i will be adding 3 arabic MMLU datasets including this one as well as part of the community suite then we can run them both and see if actually the implementation affects the evals (which shouldn't but wanna try it anyway 😅)

For clarity here is the upcoming MMLU datasets :

arabic_mmlu_mt : machine translated
arabic_mmlu_ht : our in house human translated
arabic_mmmlu : OpenAI's human translated
arabic_mmlu : The one we discuss here (MBZUAI/ArabicMMLU)

PS : "mmlu_okapi_ar" is already part of the community suite

Oct 04 '24 08:10 alielfilali01

Hey @clefourrier , I've spoke with @bakrianoo and plz feel free to close this PR. @bakrianoo maybe you can confirm here

Oct 04 '24 08:10 alielfilali01

Since @alielfilali01 is working on including this in another PR, I am going to close it. Thank you all for your support.

Oct 04 '24 12:10 bakrianoo

@alielfilali01 Sure I will switch it. The reason why I used the other repo is that previously you were either missing dedicated few-shot split or it was annoying to access subsets separately (I don't remember what reason it was exactly). Now it looks good 👍

Re instructions: The design of templates (which is what all the multilingual evals use in that file) is heavily based on OLMEs paper. Secondly since it's a bit hard to create a global instruction for all multilingual tasks I decided to not use instructions for any task. We run several ablations with this setting and we didn't notice that it would be a reason why the models can't solve the tasks. As said in olmes paper the question / answer are sufficient for guiding the model what to do. In theory we could have something like instruction registry and for each task create a generic instruction.

Re other MMLUs:

This PR should be adding openai mmlus https://github.com/huggingface/lighteval/pull/339
arabic_mmlu_mt (is this okapi one ? or different?)
arabic_mmlu_ht (I don't see a point in adding the above if you have a in-house translation)

Last note, do you think you could use the multilingual tempaltes ?

Oct 04 '24 12:10 hynky1999

@hynky1999 Actually I added the in house translated mmlu about a week before OpenAI release MMMLU and you can imagine how much effort it took to convince the team internally to release it 😅. Also i thought it's gonna be helpful to test how the translation quality impact models perf. And also just leave it to the community to decide which one they want to use ...

Note : mmlu_mt is different than mmlu_okapi_ar. First was translated using translation engine while okapi using gpt-3.5 (i guess)

Oct 05 '24 07:10 alielfilali01

Ar support for MBZUAI-arabic-mmlu

Add Support for ArabicMMLU Evaluation Task

Overview

Related Work

Notes