Add benchmark for Gemini 2.5 Pro Preview as architect model and Claude Sonnet 3.7 (32K thinking token) as editor model
I've been using Gemini 2.5 Pro Preview as my Architect (Plan) model paired with Claude 3.7 Sonnet (with 32K thinking tokens) as Editor (Act) mode for a while now, and this combination has been working exceptionally well for me. I was curious how this setup would compare to other configurations on the Aider leaderboard, so I decided to run the benchmarks locally.
Result: 75.1% pass rate in round 2 with 100% well-formed responses. This validates my personal experience with this combination.
Burned lots of token and cash too (due to iterations until I figured out the right config for thinking_tokens..). That's precisely why I'm so grateful for this leaderboard. I often look it up from time to time. It saves the community from having to individually experiment with expensive configurations.
Appreciate y'all. Keep shipping.
Hi @paul-gauthier 👋 , wondering if you had a chance to look at this one. I know you have been busy with all the stuff and releases on top of releases from providers..
Appreciate you.
Hey @tuannh99 , thanks for opening this PR. Quick question: Have you also tried Sonnet 3.7 (thinking) be the main model (architect) and have gemini be the code editor here in this case since Sonnet 3.7 thinking model's reasoning capabilities would be great for this scenario here.
Also your combo would be placed 2nd place with 75% (price $30). My question here would be - would users flock pay $30 for a 3% improvement in coding improvement over Gemini 2.5 pro (72% at $6) in diff-fenced mode. I'm also a big fan of the aider LLM leaderboard and love exploring different combos.
This article shows O1 as the architect and deepseek as the code editor has a pass rate of 85%! I wonder what the pass rate would be for a model like o3 (high) as the architect and Gemini2.5pro as the code editing model. Would be interesting to see the results here.