OLMo
OLMo copied to clipboard
[Inquire of MMLU Performance]
❓ The question
Hello OLMo team,
I have a question regarding the MMLU performance, which is currently at 28, appearing to be quite low. Could there be an issue with the evaluation code?
I think the main reason for the apparent low performance on MMLU is that the convention for MMLU is to formulate it as a true multiple-choice question, with answer choices, so the model just has to answer A or B etc. It turns out many base LMs (not instruction tuned) struggle with this format (e.g., see this paper).
In contract, the other tasks, such as ARC, are conventionally evaluated using the (rather unnatural, but effective for base LMs) "ranked classification" formulation, which taps more directly into the base LMs capability to complete text.
It turns out that some base models (such as Llama) do have some capability of handling the ABCD format, but the score difference is likely more an artifact of the training data than reflecting the understanding of the underlying subject matters.
It appears that the question has been answered. Please reopen if still relevant.