lm-evaluation-harness Add AIME 2024 and LiveCodeBenchmark to the gold standard evaluation harness

Add AIME 2024 and LiveCodeBenchmark to the gold standard evaluation harness

Open Allen-labs opened this issue 8 months ago • 2 comments

First, thank you for creating and maintaining lm-evaluation-harness, which has become the gold standard for open-source LLM benchmarking. Its comprehensive coverage and reliability make it invaluable to the AI community. I'd like to request adding support for two important tasks that would further strengthen this excellent platform: (1) AIME 2024 (2) LiveCodeBenchmark

Mar 06 '25 07:03 Allen-labs

yes！

Mar 06 '25 08:03 imagoodman-aa

+1 Has anyone just created a custom task that they would like to share?

Mar 09 '25 21:03 radna0

Hi! @Allen-labs @imagoodman-aa @radna0 I'll be happy to work on AIME and will submit a PR soon.

Apr 08 '25 15:04 Zephyr271828

Jul 10 '25 15:07 fxmarty-amd

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Add AIME 2024 and LiveCodeBenchmark to the gold standard evaluation harness

lm-evaluation-harness
lm-evaluation-harness copied to clipboard