auto-code-rover icon indicating copy to clipboard operation
auto-code-rover copied to clipboard

Possible underestimated pass@3 results

Open aorwall opened this issue 9 months ago • 3 comments

I have evaluated your predictions using my Docker based swe-bench evaluator. I achieve 26% on pass@3 compared to the 22% you reported. It might be worthwhile to review the logs for the failed benchmarks to see if your agent can actually achieve even better results :D

You find the logs and report here

And here's a sheet I use to compare the results.

aorwall avatar May 12 '24 18:05 aorwall