lm-evaluation-harness Add TMLU Benchmark Dataset

This PR adds support for the TMLU ("Measuring Taiwanese Mandarin Language Understanding" by Chen et al) benchmark dataset.

Summary

Adds a new dataset tmlu with 2,981 multiple-choice questions across 37 subjects
Uses the TMLU dataset hosted on Hugging Face
Supports evaluating Taiwanese Mandarin language understanding using log-likelihood multiple choice scoring
Includes tasks for each TMLU subject, e.g. tmlu_geography, tmlu_physics, etc.
Enables reproducing results from the Open Taiwan LLM leaderboard

Checklist

[x] Referenced the original TMLU paper
[x] Checked the TMLU reference implementation
[x] Verified the Hugging Face dataset matches the data used in the TMLU paper

Please let me know if you have any other suggestions or feedback on this PR!

Jul 12 '24 13:07 adamlin120

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

:white_check_mark: lintangsutawika
:white_check_mark: adamlin120
:x: Yen-Ting Adam, Lin

Yen-Ting Adam, Lin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Jul 12 '24 13:07 CLAassistant

@adamlin120 made some adjustments to this PR here https://github.com/adamlin120/lm-evaluation-harness/pull/1

Aug 15 '24 17:08 lintangsutawika

@adamlin120 just need your help to run pre-commit run --all-files and it should be good!

Aug 19 '24 15:08 lintangsutawika

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Add TMLU Benchmark Dataset

Summary

Checklist

lm-evaluation-harness
lm-evaluation-harness copied to clipboard