simveit comments

Results 35 comments of


                                            simveit

Benchmark for reasoning models

@zhaochenyang20 I think for LIMO we don’t have reference results from there. This is because they used this dataset for training, not evaluation. But maybe someone else did such a...

Benchmark for reasoning models

@zhaochenyang20 this PR includes adjustment of script that includes new way of evaluating suggested in deepssek repo ``` For all our models, the maximum generation length is set to 32,768...

Benchmark for reasoning models

@zhaochenyang20 32.2% instead of 28.9% was for [AIME 2024](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024). The 28.9% are from deepseek r1 repo for qwen 1.5B distill. I evaluate on LIMO later today. this will take more...

Benchmark for reasoning models

Hi @zhaochenyang20 today I ran benchmark on LIMO dataset, this time with 8 tries for each question, the accuracy was marginally higher than in one try (see updated README for...

Benchmark for reasoning models

@zhaochenyang20 now integrated improved parsing and benchmark for AIME 2025. I think this is close to merge

Benchmark for reasoning models

I don't understand. I used [router in one note setting](https://docs.sglang.ai/router/router.html#co-launch-router-and-runtimes). ``` python3 -m sglang_router.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 30000 --dp-size 4 ``` Do you mean [this way](https://docs.sglang.ai/router/router.html#launch-runtimes-and-router-separately) of launching runtime and...

simveit

Benchmark for reasoning models

Benchmark for reasoning models

Benchmark for reasoning models

Benchmark for reasoning models

Benchmark for reasoning models

Benchmark for reasoning models

Benchmark for reasoning models

Variance measure for reasoning benchmark

Variance measure for reasoning benchmark

Variance measure for reasoning benchmark