OLMo [Inquire of MMLU Performance]

[Inquire of MMLU Performance]

Open yushengsu-thu opened this issue 1 year ago • 1 comments

❓ The question

Hello OLMo team,

I have a question regarding the MMLU performance, which is currently at 28, appearing to be quite low. Could there be an issue with the evaluation code?

Feb 03 '24 08:02 yushengsu-thu

I think the main reason for the apparent low performance on MMLU is that the convention for MMLU is to formulate it as a true multiple-choice question, with answer choices, so the model just has to answer A or B etc. It turns out many base LMs (not instruction tuned) struggle with this format (e.g., see this paper).

In contract, the other tasks, such as ARC, are conventionally evaluated using the (rather unnatural, but effective for base LMs) "ranked classification" formulation, which taps more directly into the base LMs capability to complete text.

It turns out that some base models (such as Llama) do have some capability of handling the ABCD format, but the score difference is likely more an artifact of the training data than reflecting the understanding of the underlying subject matters.

Feb 05 '24 17:02 OyvindTafjord

It appears that the question has been answered. Please reopen if still relevant.

Apr 30 '24 18:04 dumitrac

OLMo OLMo copied to clipboard

[Inquire of MMLU Performance]

❓ The question

OLMo
OLMo copied to clipboard