lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Math 500
Add MATH-500 and RU-MATH500 Benchmarks
This PR adds evaluation support for MATH-500, a benchmark of 500 university-level math problems introduced by OpenAI in the Let’s Verify Step by Step paper, and RU-MATH500, its Russian translation.
Both benchmarks evaluate models on multi-step mathematical reasoning and complex problem-solving ability.
The Russian variant enables assessment of reasoning capabilities for multilingual and Russian-language models.