aider
aider copied to clipboard
Added GLM-4.6 polyglot benchmark
I just ran this I believe that I haven't enabled reasoning but I saw some * Thinking * in the responses.
this is for reference in case anyone comes from google: ─────────────────────────────────────────────────────── /benchmarks/2025-10-15-17-41-04--glm-4.6 ────────────────────────────────────────────────────────
- dirname: 2025-10-15-17-41-04--glm-4.6
test_cases: 225
model: openrouter/z-ai/glm-4.6
edit_format: diff
commit_hash: 11516d6-dirty
pass_rate_1: 11.6
pass_rate_2: 36.4
pass_num_1: 26
pass_num_2: 82
percent_cases_well_formed: 93.8
error_outputs: 26
num_malformed_responses: 17
num_with_malformed_responses: 14
user_asks: 88
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 2493948
completion_tokens: 347138
test_timeouts: 5
total_tests: 225
command: aider --model openrouter/z-ai/glm-4.6
date: 2025-10-15
versions: 0.86.2.dev
seconds_per_case: 49.8
total_cost: 1.8545
costs: $0.0082/test-case, $1.85 total, $1.85 projected
36% @pass2 ? thats worse than Qwen3 32
36% @pass2 ? thats worse than Qwen3 32
I was surprised as well but that was what I've got
Seems smth wrong, have you used "enable_thinking": True?
Seems smth wrong, have you used "enable_thinking": True?
no. I haven't that's the benchmark for non thinking variant,
in my experience glm-4.6 + deepseek 3.2 exp ( as the week model & editor model ) works surprisingly well
sometimes glm itself makes mistakes on editing files
Benchmarks on openrouter must be taken with a grain of salt because we do not always know the provider and the model quantization behind each request. It could be mixed. That may explain the surprisingly low score
These are my results:
- dirname: 2025-10-29-08-39-00--glm-4.6
test_cases: 225
model: openrouter/z-ai/glm-4.6
edit_format: diff
commit_hash: 11516d6
pass_rate_1: 16.0
pass_rate_2: 44.4
pass_num_1: 36
pass_num_2: 100
percent_cases_well_formed: 93.3
error_outputs: 22
num_malformed_responses: 16
num_with_malformed_responses: 15
user_asks: 33
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 2262299
completion_tokens: 366044
test_timeouts: 9
total_tests: 225
command: aider --model openrouter/z-ai/glm-4.6
date: 2025-10-29
versions: 0.86.2.dev
seconds_per_case: 35.3
total_cost: 1.5455
costs: $0.0069/test-case, $1.55 total, $1.55 projected
pass_rate_2: 44.4, that's better than the 36.4 you got, but still worse than I would expect.
I'm trying with the openrouter glm-4.6:exacto endpoint now to see if it makes a difference.
I'm trying with the openrouter glm-4.6:exacto endpoint now to see if it makes a difference.
- dirname: 2025-10-29-08-39-00--glm-4.6-exacto
test_cases: 225
model: openrouter/z-ai/glm-4.6:exacto
edit_format: diff
commit_hash: 11516d6
pass_rate_1: 13.8
pass_rate_2: 47.6
pass_num_1: 31
pass_num_2: 107
percent_cases_well_formed: 91.6
error_outputs: 23
num_malformed_responses: 19
num_with_malformed_responses: 19
user_asks: 93
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 2618559
completion_tokens: 397826
test_timeouts: 9
total_tests: 225
command: aider --model openrouter/z-ai/glm-4.6:exacto
date: 2025-10-29
versions: 0.86.2.dev
seconds_per_case: 43.8
total_cost: 1.7436
costs: $0.0077/test-case, $1.74 total, $1.74 projected
Slightly better, still below expectations. Might need to adjust reasoning effort or other parameters, I'm not sure which defaults are used.
Tried with reasoning_effort: high got slightly worse results:
- dirname: 2025-10-29-08-39-00--glm-4.6-exacto-reasoning-high
test_cases: 225
model: openrouter/z-ai/glm-4.6:exacto
edit_format: diff
commit_hash: 11516d6
reasoning_effort: high
pass_rate_1: 12.0
pass_rate_2: 41.3
pass_num_1: 27
pass_num_2: 93
percent_cases_well_formed: 90.7
error_outputs: 27
num_malformed_responses: 23
num_with_malformed_responses: 21
user_asks: 85
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 3086467
completion_tokens: 412182
test_timeouts: 7
total_tests: 225
command: aider --model openrouter/z-ai/glm-4.6:exacto
date: 2025-10-29
versions: 0.86.2.dev
seconds_per_case: 44.0
total_cost: 1.9559
costs: $0.0087/test-case, $1.96 total, $1.96 projected
But I'm not sure though if the parameter is forwarded and respected properly. An indicator could be that prompt_tokens and cost is a bit higher, altough seconds_per_case is just slightly increased.
time to cancel glm-4.6 subscription
let us see now the kimi k2 thinking benchmarks