aider
aider copied to clipboard
Feature: print benchmark stats broken down by language
When running the Aider benchmark, sometimes it is useful to analyze the performance of the model according to the programming language. Some users may want to choose a model that do better specifically in Go, even though the overall benchmark score may be low.
I added some self-contained code to benchmark.py so that when you call benchmark.py --stats along with --verbose, it will print the benchmark stats broken down by each language at the bottom of the report. Without --verbose, the behavior is kept unchanged.
Here is an example:
./benchmark/benchmark.py --stats --verbose reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium/
──────────────────────────────────────────── reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium ─────────────────────────────────────────────- dirname: 2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium
test_cases: 225
model: openai/openai/gpt-oss-20b
edit_format: whole
commit_hash: 32faf82-dirty
reasoning_effort: medium
pass_rate_1: 9.8
pass_rate_2: 36.0
pass_num_1: 22
pass_num_2: 81
percent_cases_well_formed: 100.0
error_outputs: 27
num_malformed_responses: 0
num_with_malformed_responses: 0
user_asks: 154
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 2162608
completion_tokens: 1224921
test_timeouts: 4
total_tests: 225
command: aider --model openai/openai/gpt-oss-20b
date: 2025-09-12
versions: 0.86.2.dev
seconds_per_case: 801.2
total_cost: 0.0000
costs: $0.0000/test-case, $0.00 total, $0.00 projected
======== Stats by language ========
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
| | python | go | rust | cpp | javascript | java |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
| completed_tests | 34 | 39 | 30 | 26 | 49 | 47 |
| duration | 24,957.62 | 21,706.71 | 17,028.67 | 51,506.41 | 29,789.68 | 35,275.56 |
| avg_duration_per_test | 734.05 | 556.58 | 567.62 | 1,981.02 | 607.95 | 750.54 |
| cost | - | - | - | - | - | - |
| pass_rate_0 | 5.88 | 5.13 | 6.67 | 7.69 | 4.08 | 4.26 |
| pass_rate_1 | 35.29 | 30.77 | 40.00 | 46.15 | 24.49 | 25.53 |
| pass_num_0 | 2 | 2 | 2 | 2 | 2 | 2 |
| pass_num_1 | 12 | 12 | 12 | 12 | 12 | 12 |
| error_outputs | 7 | 2 | 3 | - | 14 | 1 |
| user_asks | 1 | 1 | - | 139 | - | 13 |
| test_timeouts | - | - | 1 | - | 2 | 1 |
| exhausted_context_windows | - | - | - | - | - | - |
| num_malformed_responses | - | - | - | - | - | - |
| num_with_malformed_responses | - | - | - | - | - | - |
| syntax_errors | - | - | - | - | - | - |
| indentation_errors | - | - | - | - | - | - |
| lazy_comments | - | - | - | - | - | - |
| prompt_tokens | 204,931 | 159,565 | 127,949 | 1,078,034 | 247,566 | 344,563 |
| completion_tokens | 138,725 | 159,982 | 128,591 | 379,616 | 185,134 | 232,873 |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.
:white_check_mark: cryptekbits
:white_check_mark: itsmeknt
:x: dwash96
You have signed the CLA already but the status is still pending? Let us recheck it.