aider icon indicating copy to clipboard operation
aider copied to clipboard

Benchmark: Improved stats, now also printing stats for each individual test

Open Michal-Mikolas opened this issue 7 months ago • 5 comments
trafficstars

For more detailed info about benchmark and for better benchmark debugging purpose I've added feature to print more detailed stats after the benchmark is done (or when running benchmark.py --stats).

The change is backward compatible with the current results, no need to re-run the old benchmarks to see these new details.

How does it look like

obrazek ... obrazek

That's it. This change will give us better understanding of the model's strengths and weaknesses regarding different languages or info for investigating why one user got different score for the same model than other user.

Michal-Mikolas avatar Apr 11 '25 13:04 Michal-Mikolas

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 11 '25 13:04 CLAassistant

Really useful change! In my humble opinion, it would be awesome if it could also show stats on a per-language basis. In my experience, the programming ability can vary wildly from language to language. If I am looking for a strong model for coding in Python, then looking at the Go or Java tests is pretty useless.

Mushoz avatar Apr 11 '25 14:04 Mushoz

Really useful change! In my humble opinion, it would be awesome if it could also show stats on a per-language basis. In my experience, the programming ability can vary wildly from language to language. If I am looking for a strong model for coding in Python, then looking at the Go or Java tests is pretty useless.

I didn't test it, but there is this parameter in the benchmark source code:

        "--stats-languages",
        help="Only include stats for specific languages (comma separated)",

Michal-Mikolas avatar Apr 11 '25 14:04 Michal-Mikolas

@Michal-Mikolas I think the ask, which would be neat to see, would be at the end of that list of individual tests, like this perhaps...

---- breakdown ---- pass/fail timeouts syn_err user_asks malformed exhausted error lazy ind_err
     cpp/...        6/14      0        0       21        2         0         3     0    0
     go/...         3/20      0        0       15        3         0         2     0    0
     ...
     rust/...       18/3      0        0       12        5         0         2     0    0

Maybe with pass % too?

ziemkowski avatar Apr 11 '25 22:04 ziemkowski

@ziemkowski @Mushoz Ok, done.

obrazek

Michal-Mikolas avatar Apr 18 '25 22:04 Michal-Mikolas

Any feedback @paul-gauthier ?

Michal-Mikolas avatar May 14 '25 08:05 Michal-Mikolas