SWE-bench
SWE-bench copied to clipboard
Dockerization of run_evaluation.py
Describe the feature
I've been working on building Docker images for all testbeds used in SWE-Bench. This works quite well even if I still haven't got failing 18 benchmark instances when I verify against the golden patches in SWE-Bench Lite. But could be interesting to collaborate on this as it might be a more stable and performant solution than using only conda environments.
Potential Solutions
Check out this repo where I pushed all Dockerfiles, a simplified version of the TaskEnvContextManager
I use inside the Docker container and some scripts to run it all. https://github.com/aorwall/SWE-bench-docker
I'm down to 2 failing tests now in pydata/xarray 0.12. I probably need to compare to logs from a successful run to fix those effectively.
I'm also testing testbeds for the regular dataset using the check harness predictions now.
I'll chime in that @aorwall's docker images and run_evaluation.py
script have worked very well for me. I was able to run ~all the "lite" tests without problems. Whereas working with the original conda testbeds, most tests of the gold patches were failing to build or pass.
Also, the docker testbeds launch and execute very quickly compared to re-building the conda testbeds.
~all the lite meaning, not quite all? I've been struggling to get much to run
I got all except for pydata__xarray-4094
and pydata__xarray-4493
to run.
@PandelisZ sorry, I should have been more clear. I got 298 out of 300 test cases to work out of the box with @aorwall's dockerized SWE-bench-docker tooling. The 2 that fail are known not to work, so that was expected.
I only got a few of the test cases to work with the original/official conda test beds, after a half a day of trying.