SWE-bench Dockerization of run

Describe the feature

I've been working on building Docker images for all testbeds used in SWE-Bench. This works quite well even if I still haven't got failing 18 benchmark instances when I verify against the golden patches in SWE-Bench Lite. But could be interesting to collaborate on this as it might be a more stable and performant solution than using only conda environments.

Potential Solutions

Check out this repo where I pushed all Dockerfiles, a simplified version of the TaskEnvContextManager I use inside the Docker container and some scripts to run it all. https://github.com/aorwall/SWE-bench-docker

May 02 '24 17:05 aorwall

I'm down to 2 failing tests now in pydata/xarray 0.12. I probably need to compare to logs from a successful run to fix those effectively.

I'm also testing testbeds for the regular dataset using the check harness predictions now.

May 11 '24 07:05 aorwall

I'll chime in that @aorwall's docker images and run_evaluation.py script have worked very well for me. I was able to run ~all the "lite" tests without problems. Whereas working with the original conda testbeds, most tests of the gold patches were failing to build or pass.

Also, the docker testbeds launch and execute very quickly compared to re-building the conda testbeds.

May 11 '24 14:05 paul-gauthier

~all the lite meaning, not quite all? I've been struggling to get much to run

May 17 '24 16:05 PandelisZ

I got all except for pydata__xarray-4094 and pydata__xarray-4493 to run.

May 17 '24 17:05 aorwall

@PandelisZ sorry, I should have been more clear. I got 298 out of 300 test cases to work out of the box with @aorwall's dockerized SWE-bench-docker tooling. The 2 that fail are known not to work, so that was expected.

I only got a few of the test cases to work with the original/official conda test beds, after a half a day of trying.

May 17 '24 18:05 paul-gauthier

SWE-bench
SWE-bench copied to clipboard

Dockerization of run_evaluation.py

Describe the feature

Potential Solutions

SWE-bench SWE-bench copied to clipboard

Dockerization of run_evaluation.py

Describe the feature

Potential Solutions

SWE-bench
SWE-bench copied to clipboard