Benchmark: Improved stats, now also printing stats for each individual test

Open Michal-Mikolas opened this issue 7 months ago • 5 comments

trafficstars

For more detailed info about benchmark and for better benchmark debugging purpose I've added feature to print more detailed stats after the benchmark is done (or when running benchmark.py --stats).

The change is backward compatible with the current results, no need to re-run the old benchmarks to see these new details.

How does it look like

obrazek ... obrazek

That's it. This change will give us better understanding of the model's strengths and weaknesses regarding different languages or info for investigating why one user got different score for the same model than other user.

Apr 11 '25 13:04 Michal-Mikolas

All committers have signed the CLA.

Apr 11 '25 13:04 CLAassistant

Really useful change! In my humble opinion, it would be awesome if it could also show stats on a per-language basis. In my experience, the programming ability can vary wildly from language to language. If I am looking for a strong model for coding in Python, then looking at the Go or Java tests is pretty useless.

Apr 11 '25 14:04 Mushoz

Really useful change! In my humble opinion, it would be awesome if it could also show stats on a per-language basis. In my experience, the programming ability can vary wildly from language to language. If I am looking for a strong model for coding in Python, then looking at the Go or Java tests is pretty useless.

I didn't test it, but there is this parameter in the benchmark source code:

        "--stats-languages",
        help="Only include stats for specific languages (comma separated)",

Apr 11 '25 14:04 Michal-Mikolas

@Michal-Mikolas I think the ask, which would be neat to see, would be at the end of that list of individual tests, like this perhaps...

---- breakdown ---- pass/fail timeouts syn_err user_asks malformed exhausted error lazy ind_err
     cpp/...        6/14      0        0       21        2         0         3     0    0
     go/...         3/20      0        0       15        3         0         2     0    0
     ...
     rust/...       18/3      0        0       12        5         0         2     0    0

Maybe with pass % too?

Apr 11 '25 22:04 ziemkowski

@ziemkowski @Mushoz Ok, done.

obrazek

Apr 18 '25 22:04 Michal-Mikolas

Any feedback @paul-gauthier ?

May 14 '25 08:05 Michal-Mikolas

aider aider copied to clipboard

Benchmark: Improved stats, now also printing stats for each individual test

How does it look like

aider
aider copied to clipboard