aider icon indicating copy to clipboard operation
aider copied to clipboard

Feature: print benchmark stats broken down by language

Open itsmeknt opened this issue 2 months ago • 1 comments

When running the Aider benchmark, sometimes it is useful to analyze the performance of the model according to the programming language. Some users may want to choose a model that do better specifically in Go, even though the overall benchmark score may be low.

I added some self-contained code to benchmark.py so that when you call benchmark.py --stats along with --verbose, it will print the benchmark stats broken down by each language at the bottom of the report. Without --verbose, the behavior is kept unchanged.

Here is an example:

./benchmark/benchmark.py --stats --verbose reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium/

──────────────────────────────────────────── reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium ─────────────────────────────────────────────- dirname: 2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium
  test_cases: 225
  model: openai/openai/gpt-oss-20b
  edit_format: whole
  commit_hash: 32faf82-dirty
  reasoning_effort: medium
  pass_rate_1: 9.8
  pass_rate_2: 36.0
  pass_num_1: 22
  pass_num_2: 81
  percent_cases_well_formed: 100.0
  error_outputs: 27
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 154
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2162608
  completion_tokens: 1224921
  test_timeouts: 4
  total_tests: 225
  command: aider --model openai/openai/gpt-oss-20b
  date: 2025-09-12
  versions: 0.86.2.dev
  seconds_per_case: 801.2
  total_cost: 0.0000

costs: $0.0000/test-case, $0.00 total, $0.00 projected

======== Stats by language ========

| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
|                              |   python  |     go    |    rust   |    cpp    | javascript |    java   |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
| completed_tests              |        34 |        39 |        30 |        26 |         49 |        47 |
| duration                     | 24,957.62 | 21,706.71 | 17,028.67 | 51,506.41 |  29,789.68 | 35,275.56 |
| avg_duration_per_test        |    734.05 |    556.58 |    567.62 |  1,981.02 |     607.95 |    750.54 |
| cost                         |         - |         - |         - |         - |          - |         - |
| pass_rate_0                  |      5.88 |      5.13 |      6.67 |      7.69 |       4.08 |      4.26 |
| pass_rate_1                  |     35.29 |     30.77 |     40.00 |     46.15 |      24.49 |     25.53 |
| pass_num_0                   |         2 |         2 |         2 |         2 |          2 |         2 |
| pass_num_1                   |        12 |        12 |        12 |        12 |         12 |        12 |
| error_outputs                |         7 |         2 |         3 |         - |         14 |         1 |
| user_asks                    |         1 |         1 |         - |       139 |          - |        13 |
| test_timeouts                |         - |         - |         1 |         - |          2 |         1 |
| exhausted_context_windows    |         - |         - |         - |         - |          - |         - |
| num_malformed_responses      |         - |         - |         - |         - |          - |         - |
| num_with_malformed_responses |         - |         - |         - |         - |          - |         - |
| syntax_errors                |         - |         - |         - |         - |          - |         - |
| indentation_errors           |         - |         - |         - |         - |          - |         - |
| lazy_comments                |         - |         - |         - |         - |          - |         - |
| prompt_tokens                |   204,931 |   159,565 |   127,949 | 1,078,034 |    247,566 |   344,563 |
| completion_tokens            |   138,725 |   159,982 |   128,591 |   379,616 |    185,134 |   232,873 |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

itsmeknt avatar Sep 18 '25 00:09 itsmeknt

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

:white_check_mark: cryptekbits
:white_check_mark: itsmeknt
:x: dwash96
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Sep 18 '25 00:09 CLAassistant