sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Benchmark for reasoning models

Open simveit opened this issue 10 months ago • 10 comments

Motivation

To evaluate reasoning models it makes sense to use difficult questions. This benchmark intends to use evaluate on the LIMO dataset. The Qwen 1.5B distill archives 47% accuracy PASS@1.

Modifications

A script to benchmark on LIMO.

Checklist

simveit avatar Feb 12 '25 20:02 simveit

Please take a look @zhaochenyang20 I think before merging that we should further refine the parsing of answer and maybe report also majority voting as is commonly done for this kind of benchmark.

simveit avatar Feb 12 '25 20:02 simveit

@simveit What's the official score of this model by LIMO team? Could we align with them?

zhaochenyang20 avatar Feb 13 '25 06:02 zhaochenyang20

@zhaochenyang20 I think for LIMO we don’t have reference results from there. This is because they used this dataset for training, not evaluation. But maybe someone else did such a benchmarking I am not aware of. Maybe we can ask them if they did such an evaluation internally?

Also we should adjust the script to follow Deepseek more closely:

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 64 responses per query to estimate pass@1.

I will closely resemble this approach in the next update of this branch which I intend to do in the next one or two days.

simveit avatar Feb 13 '25 18:02 simveit

@simveit Thanks. Look forward to it.

zhaochenyang20 avatar Feb 14 '25 00:02 zhaochenyang20

@zhaochenyang20 this PR includes adjustment of script that includes new way of evaluating suggested in deepssek repo

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 64 responses per query to estimate pass@1.

also we include easy possibility to evaluate on other datasets and use that to benchmark on AIME 2024. The result is somewhat suprising: We get 32.2% instead of 28.9% reported in repo.

I wonder if the discrepancy is due to:

  • the suffix to prompt \nPlease reason step by step, and put your final answer within \boxed{}. which is commonly used for DeepSeek Math model and also recommended in deepseek r1 repo
  • maybe the reported result is at temperature 0
  • maybe something is wrong in the way I evaluate

simveit avatar Feb 14 '25 20:02 simveit

I don't think you are wrong. I will ask help from pengfei and see his idea.

zhaochenyang20 avatar Feb 15 '25 01:02 zhaochenyang20

@simveit Hey. We are discussing LIMO here. What's for AIME?

zhaochenyang20 avatar Feb 15 '25 01:02 zhaochenyang20

@zhaochenyang20 32.2% instead of 28.9% was for AIME 2024. The 28.9% are from deepseek r1 repo for qwen 1.5B distill.

I evaluate on LIMO later today. this will take more time than AIME because for LIMO we will eval on ~800*64=51200 samples. For LIMO we don't have any reference, thats why I used AIME to see if the script gives a reasonable result.

simveit avatar Feb 15 '25 04:02 simveit

Wow. We are better than DeepSeek officially 😂 Love to see your PR on both AIME and LIMO @simveit

zhaochenyang20 avatar Feb 15 '25 18:02 zhaochenyang20

Hi @zhaochenyang20 today I ran benchmark on LIMO dataset, this time with 8 tries for each question, the accuracy was marginally higher than in one try (see updated README for result).

Maybe we can also rename the benchmark to benchmark_reasoning or something like that. Generally (I believe) it is suitable for every dataset of the form question/answer with answer an int onto which we want to benchmark Deepseek. WDYT?

Next step:

  • Study parsing function from DeepSeek Math repo and see if we can use them to make the parsing more robust and possibly be able to effectively evaluate also on datasets which have non integer answers, for example $\frac{\pi}{2}$.

simveit avatar Feb 15 '25 18:02 simveit

@zhaochenyang20 now integrated improved parsing and benchmark for AIME 2025. I think this is close to merge

simveit avatar Feb 16 '25 18:02 simveit

I don't understand. I used router in one note setting.

python3 -m sglang_router.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 30000 --dp-size 4

Do you mean this way of launching runtime and router seperately?

simveit avatar Feb 16 '25 18:02 simveit

Oh. Sorry. I take it wrong. You are right.

zhaochenyang20 avatar Feb 16 '25 19:02 zhaochenyang20

We will merge it today!

zhaochenyang20 avatar Feb 16 '25 19:02 zhaochenyang20

great

simveit avatar Feb 16 '25 19:02 simveit