Benchmark for reasoning models
Motivation
To evaluate reasoning models it makes sense to use difficult questions. This benchmark intends to use evaluate on the LIMO dataset. The Qwen 1.5B distill archives 47% accuracy PASS@1.
Modifications
A script to benchmark on LIMO.
Checklist
- [x ] Format your code according to the Code Formatting with Pre-Commit.
- [x ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
Please take a look @zhaochenyang20 I think before merging that we should further refine the parsing of answer and maybe report also majority voting as is commonly done for this kind of benchmark.
@simveit What's the official score of this model by LIMO team? Could we align with them?
@zhaochenyang20 I think for LIMO we don’t have reference results from there. This is because they used this dataset for training, not evaluation. But maybe someone else did such a benchmarking I am not aware of. Maybe we can ask them if they did such an evaluation internally?
Also we should adjust the script to follow Deepseek more closely:
For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 64 responses per query to estimate pass@1.
I will closely resemble this approach in the next update of this branch which I intend to do in the next one or two days.
@simveit Thanks. Look forward to it.
@zhaochenyang20 this PR includes adjustment of script that includes new way of evaluating suggested in deepssek repo
For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 64 responses per query to estimate pass@1.
also we include easy possibility to evaluate on other datasets and use that to benchmark on AIME 2024. The result is somewhat suprising: We get 32.2% instead of 28.9% reported in repo.
I wonder if the discrepancy is due to:
- the suffix to prompt
\nPlease reason step by step, and put your final answer within \boxed{}.which is commonly used for DeepSeek Math model and also recommended in deepseek r1 repo - maybe the reported result is at temperature 0
- maybe something is wrong in the way I evaluate
I don't think you are wrong. I will ask help from pengfei and see his idea.
@simveit Hey. We are discussing LIMO here. What's for AIME?
@zhaochenyang20 32.2% instead of 28.9% was for AIME 2024. The 28.9% are from deepseek r1 repo for qwen 1.5B distill.
I evaluate on LIMO later today. this will take more time than AIME because for LIMO we will eval on ~800*64=51200 samples. For LIMO we don't have any reference, thats why I used AIME to see if the script gives a reasonable result.
Wow. We are better than DeepSeek officially 😂 Love to see your PR on both AIME and LIMO @simveit
Hi @zhaochenyang20 today I ran benchmark on LIMO dataset, this time with 8 tries for each question, the accuracy was marginally higher than in one try (see updated README for result).
Maybe we can also rename the benchmark to benchmark_reasoning or something like that.
Generally (I believe) it is suitable for every dataset of the form question/answer with answer an int onto which we want to benchmark Deepseek. WDYT?
Next step:
- Study parsing function from DeepSeek Math repo and see if we can use them to make the parsing more robust and possibly be able to effectively evaluate also on datasets which have non integer answers, for example $\frac{\pi}{2}$.
@zhaochenyang20 now integrated improved parsing and benchmark for AIME 2025. I think this is close to merge
I don't understand. I used router in one note setting.
python3 -m sglang_router.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 30000 --dp-size 4
Do you mean this way of launching runtime and router seperately?
Oh. Sorry. I take it wrong. You are right.
We will merge it today!
great