sglang Extensive benchmarking of reasoning models including variance

In their R1 repo deepseek people recommend to estimate PASS@1 by asking the same question various times. We implemented that into our Reasoning benchmark. Additionaly to the averaged accuracy we report also average standard error as a measurement of uncertainty. Ideally that would include some plots.

Now we want to perform experiments how the results change under

repeated experiment with same hyperparameters
increased number of trials

I think the AIME 2024 is suited to this experiment because LIMO is quiet large and it will take long time to run experiments with a large number of trials.

Please see recently merged brach that includes measurement of uncertainty in reasoning models answers for more details and detailed explanation of the metrics.

Feel free to reach out to me if you have further questions.

Feb 20 '25 07:02 simveit

Great! @tanzelin430 please take a look at this.

Feb 20 '25 19:02 zhaochenyang20

Hi, I'm not sure this is the right place to ask. I'm curious about the reasoning behind calculating the mean standard error. For measuring model consistency/uncertainty, the mean standard deviation might be more appropriate as it directly represents variability without being influenced by sample size. For estimating the uncertainty in the accuracy measurement, the standard binomial standard error (sqrt(p*(1-p)/N)) would be more conventional. Thanks.

Mar 08 '25 19:03 pmarinroig

Hi, I'm not sure this is the right place to ask. I'm curious about the reasoning behind calculating the mean standard error. For measuring model consistency/uncertainty, the mean standard deviation might be more appropriate as it directly represents variability without being influenced by sample size. For estimating the uncertainty in the accuracy measurement, the standard binomial standard error (sqrt(p*(1-p)/N)) would be more conventional. Thanks.

We have for each question num_tries answers. This can be viewed as a binomial experiement with num_tries trials. we than use the formula you have given above to estimate the per question std. we'll than take the mean of that over all questions to get an estimate of how far we deviate on average per question.

this is of course a little bit handwavy but the experiments carried out by zelin tan show that it is sufficent to get an upper bound for the deviation we can expect. we will soon integrate these results into the benchmark.

Mar 08 '25 21:03 simveit