aider Added GLM-4.6 polyglot benchmark

I just ran this I believe that I haven't enabled reasoning but I saw some * Thinking * in the responses.

Oct 15 '25 18:10 ehsan2003

All committers have signed the CLA.

Oct 15 '25 18:10 CLAassistant

this is for reference in case anyone comes from google: ‍‍‍ ─────────────────────────────────────────────────────── /benchmarks/2025-10-15-17-41-04--glm-4.6 ────────────────────────────────────────────────────────

- dirname: 2025-10-15-17-41-04--glm-4.6
  test_cases: 225
  model: openrouter/z-ai/glm-4.6
  edit_format: diff
  commit_hash: 11516d6-dirty
  pass_rate_1: 11.6
  pass_rate_2: 36.4
  pass_num_1: 26
  pass_num_2: 82
  percent_cases_well_formed: 93.8
  error_outputs: 26
  num_malformed_responses: 17
  num_with_malformed_responses: 14
  user_asks: 88
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2493948
  completion_tokens: 347138
  test_timeouts: 5
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6
  date: 2025-10-15
  versions: 0.86.2.dev
  seconds_per_case: 49.8
  total_cost: 1.8545

costs: $0.0082/test-case, $1.85 total, $1.85 projected

Oct 15 '25 18:10 ehsan2003

36% @pass2 ? thats worse than Qwen3 32

Oct 16 '25 16:10 MotherSoraka

36% @pass2 ? thats worse than Qwen3 32

I was surprised as well but that was what I've got

Oct 16 '25 21:10 ehsan2003

Seems smth wrong, have you used "enable_thinking": True?

Oct 19 '25 08:10 bayorm

Seems smth wrong, have you used "enable_thinking": True?

no. I haven't that's the benchmark for non thinking variant,

in my experience glm-4.6 + deepseek 3.2 exp ( as the week model & editor model ) works surprisingly well

sometimes glm itself makes mistakes on editing files

Oct 19 '25 11:10 ehsan2003

Benchmarks on openrouter must be taken with a grain of salt because we do not always know the provider and the model quantization behind each request. It could be mixed. That may explain the surprisingly low score

Oct 22 '25 17:10 cperion

These are my results:

- dirname: 2025-10-29-08-39-00--glm-4.6
  test_cases: 225
  model: openrouter/z-ai/glm-4.6
  edit_format: diff
  commit_hash: 11516d6
  pass_rate_1: 16.0
  pass_rate_2: 44.4
  pass_num_1: 36
  pass_num_2: 100
  percent_cases_well_formed: 93.3
  error_outputs: 22
  num_malformed_responses: 16
  num_with_malformed_responses: 15
  user_asks: 33
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2262299
  completion_tokens: 366044
  test_timeouts: 9
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6
  date: 2025-10-29
  versions: 0.86.2.dev
  seconds_per_case: 35.3
  total_cost: 1.5455

costs: $0.0069/test-case, $1.55 total, $1.55 projected

pass_rate_2: 44.4, that's better than the 36.4 you got, but still worse than I would expect.

I'm trying with the openrouter glm-4.6:exacto endpoint now to see if it makes a difference.

Oct 29 '25 08:10 janus-reith

I'm trying with the openrouter glm-4.6:exacto endpoint now to see if it makes a difference.

- dirname: 2025-10-29-08-39-00--glm-4.6-exacto
  test_cases: 225
  model: openrouter/z-ai/glm-4.6:exacto
  edit_format: diff
  commit_hash: 11516d6
  pass_rate_1: 13.8
  pass_rate_2: 47.6
  pass_num_1: 31
  pass_num_2: 107
  percent_cases_well_formed: 91.6
  error_outputs: 23
  num_malformed_responses: 19
  num_with_malformed_responses: 19
  user_asks: 93
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2618559
  completion_tokens: 397826
  test_timeouts: 9
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6:exacto
  date: 2025-10-29
  versions: 0.86.2.dev
  seconds_per_case: 43.8
  total_cost: 1.7436

costs: $0.0077/test-case, $1.74 total, $1.74 projected

Slightly better, still below expectations. Might need to adjust reasoning effort or other parameters, I'm not sure which defaults are used.

Oct 29 '25 08:10 janus-reith

Tried with reasoning_effort: high got slightly worse results:

- dirname: 2025-10-29-08-39-00--glm-4.6-exacto-reasoning-high
  test_cases: 225
  model: openrouter/z-ai/glm-4.6:exacto
  edit_format: diff
  commit_hash: 11516d6
  reasoning_effort: high
  pass_rate_1: 12.0
  pass_rate_2: 41.3
  pass_num_1: 27
  pass_num_2: 93
  percent_cases_well_formed: 90.7
  error_outputs: 27
  num_malformed_responses: 23
  num_with_malformed_responses: 21
  user_asks: 85
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 3086467
  completion_tokens: 412182
  test_timeouts: 7
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6:exacto
  date: 2025-10-29
  versions: 0.86.2.dev
  seconds_per_case: 44.0
  total_cost: 1.9559

costs: $0.0087/test-case, $1.96 total, $1.96 projected

But I'm not sure though if the parameter is forwarded and respected properly. An indicator could be that prompt_tokens and cost is a bit higher, altough seconds_per_case is just slightly increased.

Oct 29 '25 09:10 janus-reith

time to cancel glm-4.6 subscription

Nov 15 '25 09:11 Kreijstal

let us see now the kimi k2 thinking benchmarks

Nov 15 '25 09:11 Kreijstal

aider aider copied to clipboard

Added GLM-4.6 polyglot benchmark

aider
aider copied to clipboard