auto-code-rover
auto-code-rover copied to clipboard
Possible underestimated pass@3 results
I have evaluated your predictions using my Docker based swe-bench evaluator. I achieve 26% on pass@3 compared to the 22% you reported. It might be worthwhile to review the logs for the failed benchmarks to see if your agent can actually achieve even better results :D
You find the logs and report here
And here's a sheet I use to compare the results.