lm-evaluation-harness
lm-evaluation-harness copied to clipboard
mela
Add the ACL 2024 benchmark mela (multilingual evaluation of linguistic acceptability)
@Geralt-Targaryen Thanks for the contribution! Can you see about reporoducing some of the scores reported in Table 3 to validate the implementation is working correctly?
@Geralt-Targaryen Thanks for the contribution! Can you see about reporoducing some of the scores reported in Table 3 to validate the implementation is working correctly?
Yes, here are some models' results from our original implementation and evaluation harness implementation:
| model | shot | original (reported in the paper) | lm eval harness |
|---|---|---|---|
| BLOOMZ 7B | 0 | 5.85 | 5.99±0.85 |
| BLOOMZ 7B | 2 | 4.31 | 4.11±0.87 |
| mT0 13B | 0 | 6.62 | 7.72±0.88 |
| mT0 13B | 2 | 7.70 | 5.82±0.75 |
| mTk 13B | 0 | 2.24 | 3.16±1.01 |
| mTk 13B | 2 | 12.05 | 12.26±0.98 |
As we explained in the paper, linguistic acceptability is a task with large performance variations. Fluctuations that result from the selection of in-context examples, floating point precisions, and prompt formatting are expected. A slight difference between the two implementations is that in our original version, we used two newlines after the task description, but it seems that eval harness treats multiple newlines after the task description as a single one.
@StellaAthena @lintangsutawika @haileyschoelkopf Hi all, I'm one of the authors of MELA. We have reproduced some of the results in Table 3 of our paper (see previous comment). Could you let us know if there is anything else we need to do on our end? We (and some of our collaborators) are eager to evaluate our multilingual models using lm-evaluation-harness.
As we mentioned, MELA is a multilingual version of CoLA, and it covers 10 languages: en, zh, ru, it, de, fr, es, ja, ar, is. The paper has been accepted to this year's ACL, and we really hope the community could use the benchmark to facilitate their LLM evaluation. Thanks in advance!
Our paper is here: https://aclanthology.org/2024.acl-long.146/ Our data is here: https://github.com/sjtu-compling/MELA
~~@Geralt-Targaryen made a PR to your PR to update a fewthings. https://github.com/Geralt-Targaryen/lm-evaluation-harness/pull/1~~
nvm.