lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Math 500

Open seldereyy opened this issue 3 weeks ago • 1 comments

Add MATH-500 and RU-MATH500 Benchmarks

This PR adds evaluation support for MATH-500, a benchmark of 500 university-level math problems introduced by OpenAI in the Let’s Verify Step by Step paper, and RU-MATH500, its Russian translation.

Both benchmarks evaluate models on multi-step mathematical reasoning and complex problem-solving ability.
The Russian variant enables assessment of reasoning capabilities for multilingual and Russian-language models.

seldereyy avatar Nov 01 '25 14:11 seldereyy

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Nov 01 '25 14:11 CLAassistant