Ar support for MBZUAI-arabic-mmlu
Add Support for ArabicMMLU Evaluation Task
Overview
This PR introduces a new evaluation task for Arabic Language Models (LLMs) using the ArabicMMLU dataset, as detailed in the paper "ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic". The ArabicMMLU dataset provides a comprehensive benchmark for evaluating the performance of LLMs on a wide range of tasks in the Arabic language.
Related Work
- ArabicMMLU Dataset: Hugging Face
- Paper: ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
Notes
This contribution aligns with the ongoing efforts to expand the capabilities of lightEval in supporting diverse languages and tasks. Feedback and suggestions are welcome!
Hi, thanks for your PR! FYI, we have a small backlog of PRs to go through so we might take about a week to address it
In the meantime, please make sure that the styling is correct :)
@NathanHB Could be worth waiting for #214 (last PR of the above serie) before editing this one to fit the new format
Hi! I think once you update the PR to the new format for metrics, prompts and functions, we'll be good to go! Also tagging @alielfilali01 since he was the author of the original file for arabic_evals (these are behind the arabic LLM leaderboard) to get his opinion too.
Tnx @clefourrier for the tag and thanks dear @bakrianoo for your valuable contribution 🤗
I have one remark to make related to the comment i left above :
hf_subset="default" will load the default subset which is also the test subset used for the eval and few shots !
Solution for me would be to drop eval subset and never use few shots in this benchmark ! OR Make a custom version of this dataset with test and val subsets in it !?
I agree with the comment - if you can setup your dataset to have different splits for few shots, it will avoid context contamination. You also need to run the linters to get code quality
Thank you @alielfilali01 for your comment. I am wondering if creating a new dataset would violate any license issues for the original dataset!
I need to confirm with the dataset authors, or we can follow the zer-shot suggestion.
What do you think?
@bakrianoo , If it's not Apache2.0 for example then let's open a discussion in the repo and see if the authors would help with creating the dataset themselves. Plz feel free to do it and if no response then i can reach out directly to one or two of the authors... What do you think?
Sure. I will start the discussion there. Thank you @alielfilali01 for your interesting.
Hi! Feel free to tell us when this is updated!
I think it's solved by this pr https://github.com/huggingface/lighteval/pull/338. See mmlu_ara_mcf
If you want to use native arabic letter as options anchors use:
formulation=MCFFormulation("NativeLetters")
as formulation for the task
cc @clefourrier
I think it's solved by this pr #338. See mmlu_ara_mcf
cc @clefourrier
@hynky1999 can you plz mention the exact name from the pr you just mentioned? Cuz the mmlu here is different than the other mmlus and couldn't find it in the list from pr #338. Also this pr is a stale now and feel free to close since I'm planning to merge this version of mmlu in an upcoming pr myb next week alongside other new arabic tasks
I think it's solved by this pr #338. See mmlu_ara_mcf cc @clefourrier
@hynky1999 can you plz mention the exact name from the pr you just mentioned? Cuz the mmlu here is different than the other mmlus and couldn't find it in the list from pr #338. Also this pr is a stale now and feel free to close since I'm planning to merge this version of mmlu in an upcoming pr myb next week alongside other new arabic tasks
Sorry just saw you mentioned "mmlu_ara_mcf" ma bad. I see so many commits in the pr 😅 i will need to open it from laptop. I get back to you on it tomorrow at max so i can plan to remove it from my pr if all is good. Tnx man for taking care of it
So many commits are because of how we we were merging the prs hhhhh
What tasks are you planning to add btw ?
What tasks are you planning to add btw ?
Some new benchmarks we got from some colleagues and partners and convinced them to make them public 😁
Hey @hynky1999, Sorry couldn't get back to you yesterday. Well i saw the arabic mmlu task and it seems good. I'm just not sure about the instruction if you can provide more details on that. Also i saw the hf_repo is yazeed7/ArabicMMLU where the official release is MBZUAI/ArabicMMLU if you can change that plz.
In my upcoming pr i will be adding 3 arabic MMLU datasets including this one as well as part of the community suite then we can run them both and see if actually the implementation affects the evals (which shouldn't but wanna try it anyway 😅)
For clarity here is the upcoming MMLU datasets :
- arabic_mmlu_mt : machine translated
- arabic_mmlu_ht : our in house human translated
- arabic_mmmlu : OpenAI's human translated
- arabic_mmlu : The one we discuss here (MBZUAI/ArabicMMLU)
PS : "mmlu_okapi_ar" is already part of the community suite
Hey @clefourrier , I've spoke with @bakrianoo and plz feel free to close this PR. @bakrianoo maybe you can confirm here
Since @alielfilali01 is working on including this in another PR, I am going to close it. Thank you all for your support.
@alielfilali01 Sure I will switch it. The reason why I used the other repo is that previously you were either missing dedicated few-shot split or it was annoying to access subsets separately (I don't remember what reason it was exactly). Now it looks good 👍
Re instructions:
The design of templates (which is what all the multilingual evals use in that file) is heavily based on OLMEs paper. Secondly since it's a bit hard to create a global instruction for all multilingual tasks I decided to not use instructions for any task. We run several ablations with this setting and we didn't notice that it would be a reason why the models can't solve the tasks.
As said in olmes paper the question / answer are sufficient for guiding the model what to do.
In theory we could have something like instruction registry and for each task create a generic instruction.
Re other MMLUs:
- This PR should be adding openai mmlus https://github.com/huggingface/lighteval/pull/339
- arabic_mmlu_mt (is this okapi one ? or different?)
- arabic_mmlu_ht (I don't see a point in adding the above if you have a in-house translation)
Last note, do you think you could use the multilingual tempaltes ?
@hynky1999 Actually I added the in house translated mmlu about a week before OpenAI release MMMLU and you can imagine how much effort it took to convince the team internally to release it 😅. Also i thought it's gonna be helpful to test how the translation quality impact models perf. And also just leave it to the community to decide which one they want to use ...
Note : mmlu_mt is different than mmlu_okapi_ar. First was translated using translation engine while okapi using gpt-3.5 (i guess)