Variance measure for reasoning benchmark
Motivation
In this PR we introduces reasoning benchmark.
We estimate
$PASS@1 = \frac{1}{N_{question}}\sum_{i=1}^{N_{question}}\frac{1}{N_{tries}}\sum_{j=1}^{N_{tries}}correct_{i,j}$ Where $correct_{i,j}$ is 1 if question i is correctly answered in try j.
In this PR we want to perform benchmarking not only on the accuracy but also on the variance of the answers. For this we use the metric:
$\frac{1}{N_{question}}\sum_{i=1}^{N_{question}}SE_i$ where $SE_i=\frac{1}{\sqrt{N_{tries}}}\sigma_i$, i.e. the standard error of question i. This means $\frac{1}{N_{question}}\sum_{i=1}^{N_{question}}SE_i$ should reflect how much we deviate on average from the reported accuracy.
Next steps:
- Use the provided code to benchmark the standard error on AIME 2024. For instructions how to run the benchmark on AIME please see the provided
README. Run multiple times to see how accurate the results are. - Report the results from the first step in a plot and include this plot in the README.
@zhaochenyang20 maybe someone can take on from here. The only thing that remains to be done is to run the benchmark multiple times.
@simveit should this be an issue or PR? I can advocate for others to take.
Not that you say it maybe its a cleaner way to make this an PR and let me write a seperarte Issue for the benchmarking. This code is working and completed. What do you think?
@simveit could you send me the issue link and tell others how to do variance measurements, from how to run codes 😂
I find someone interested in this. Also, should we merge this PR now?
yes we can merge this PR. I will write the issue later.
@simveit I told yineng to merge it. Thanks! @zhyncs