aider
aider copied to clipboard
Benchmark: Improved stats, now also printing stats for each individual test
For more detailed info about benchmark and for better benchmark debugging purpose I've added feature to print more detailed stats after the benchmark is done (or when running benchmark.py --stats).
The change is backward compatible with the current results, no need to re-run the old benchmarks to see these new details.
How does it look like
...
That's it. This change will give us better understanding of the model's strengths and weaknesses regarding different languages or info for investigating why one user got different score for the same model than other user.
Really useful change! In my humble opinion, it would be awesome if it could also show stats on a per-language basis. In my experience, the programming ability can vary wildly from language to language. If I am looking for a strong model for coding in Python, then looking at the Go or Java tests is pretty useless.
Really useful change! In my humble opinion, it would be awesome if it could also show stats on a per-language basis. In my experience, the programming ability can vary wildly from language to language. If I am looking for a strong model for coding in Python, then looking at the Go or Java tests is pretty useless.
I didn't test it, but there is this parameter in the benchmark source code:
"--stats-languages",
help="Only include stats for specific languages (comma separated)",
@Michal-Mikolas I think the ask, which would be neat to see, would be at the end of that list of individual tests, like this perhaps...
---- breakdown ---- pass/fail timeouts syn_err user_asks malformed exhausted error lazy ind_err
cpp/... 6/14 0 0 21 2 0 3 0 0
go/... 3/20 0 0 15 3 0 2 0 0
...
rust/... 18/3 0 0 12 5 0 2 0 0
Maybe with pass % too?
@ziemkowski @Mushoz Ok, done.
Any feedback @paul-gauthier ?