auto-code-rover Possible underestimated pass@3 results

Possible underestimated pass@3 results

Open aorwall opened this issue 9 months ago • 3 comments

I have evaluated your predictions using my Docker based swe-bench evaluator. I achieve 26% on pass@3 compared to the 22% you reported. It might be worthwhile to review the logs for the failed benchmarks to see if your agent can actually achieve even better results :D

You find the logs and report here

And here's a sheet I use to compare the results.

May 12 '24 18:05 aorwall

auto-code-rover auto-code-rover copied to clipboard

Possible underestimated pass@3 results

auto-code-rover
auto-code-rover copied to clipboard