sglang Variance measure for reasoning benchmark

Motivation

In this PR we introduces reasoning benchmark.

We estimate

$PASS@1 = \frac{1}{N_{question}}\sum_{i=1}^{N_{question}}\frac{1}{N_{tries}}\sum_{j=1}^{N_{tries}}correct_{i,j}$ Where $correct_{i,j}$ is 1 if question i is correctly answered in try j.

In this PR we want to perform benchmarking not only on the accuracy but also on the variance of the answers. For this we use the metric:

$\frac{1}{N_{question}}\sum_{i=1}^{N_{question}}SE_i$ where $SE_i=\frac{1}{\sqrt{N_{tries}}}\sigma_i$, i.e. the standard error of question i. This means $\frac{1}{N_{question}}\sum_{i=1}^{N_{question}}SE_i$ should reflect how much we deviate on average from the reported accuracy.

Next steps:

Use the provided code to benchmark the standard error on AIME 2024. For instructions how to run the benchmark on AIME please see the provided README. Run multiple times to see how accurate the results are.
Report the results from the first step in a plot and include this plot in the README.

Feb 18 '25 19:02 simveit

@zhaochenyang20 maybe someone can take on from here. The only thing that remains to be done is to run the benchmark multiple times.

Feb 18 '25 19:02 simveit

@simveit should this be an issue or PR? I can advocate for others to take.

Feb 19 '25 00:02 zhaochenyang20

Not that you say it maybe its a cleaner way to make this an PR and let me write a seperarte Issue for the benchmarking. This code is working and completed. What do you think?

Feb 19 '25 10:02 simveit

@simveit could you send me the issue link and tell others how to do variance measurements, from how to run codes 😂

I find someone interested in this. Also, should we merge this PR now?

Feb 19 '25 18:02 zhaochenyang20

yes we can merge this PR. I will write the issue later.

Feb 19 '25 18:02 simveit

@simveit I told yineng to merge it. Thanks! @zhyncs

Feb 19 '25 18:02 zhaochenyang20