SWE-agent Speed up evaluation by caching task environments as docker images

What does this implement?

This PR introduces the ability to cache the environment created for each SWE-bench task as the docker image. It saves the filesystem and environment variables (using a file) with docker commit, which produces a new docker image with the tag unique to the given task. The tag contains the dataset name, split, and task number. The feature could be enabled by flag --cache_task_images.

This change addresses the issue of spending a big chunk of evaluation time on setting up the task environments. Timing test on a dev split of princeton-nlp/SWE-bench_Lite (23 tasks), on a 2-core VM:

Avg. time to prepare 1 task environment: 78.3 sec
Avg. time to load cached environment from the image: 10.1 sec

As the repo states the avg. task run time of 1.5 minutes, this PR improves the speed of the consecutive evaluations by up to 40% (for some HW setups).

Any other comments?

I expected the change to use a small amount of disk space since all task environments share the same base image and Docker uses OverlayFS to avoid storing duplicate image parts. However, each image ends up using ~1.5GB of disk space per task. The dev split of SWE-bench_lite requires ~40GB of disk space, while the test split would consume ~500 GB. Although this issue should be addressed later, it still could be a reasonable trade-off when running a few consecutive evaluations to test some changes.

May 06 '24 20:05 ollmer

Very cool stuff! I'll take a closer look at that on Friday!

May 08 '24 01:05 klieret

Codecov Report

Attention: Patch coverage is 37.50000% with 35 lines in your changes are missing coverage. Please review.

:exclamation: No coverage uploaded for pull request base (main@088aabd). Click here to learn what that means. Report is 1 commits behind head on main.

:exclamation: Current head fef3d32 differs from pull request most recent head 3d28971. Consider uploading reports for the commit 3d28971 to get more accurate results

Files	Patch %	Lines
sweagent/environment/swe_env.py	29.54%	31 Missing :warning:
sweagent/environment/utils.py	66.66%	4 Missing :warning:

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #317   +/-   ##
=======================================
  Coverage        ?   75.72%           
=======================================
  Files           ?       18           
  Lines           ?     2892           
  Branches        ?        0           
=======================================
  Hits            ?     2190           
  Misses          ?      702           
  Partials        ?        0

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

May 08 '24 01:05 codecov[bot]

I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of data_path/rejecting the flag if data_path is something unsitable

May 13 '24 20:05 klieret

I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of data_path/rejecting the flag if data_path is something unsuitable

Let me push this on top of your branch :)

May 13 '24 20:05 klieret

I think this looks great! The only thing we'd have to fix is the naming issue depending on the nature of data_path/rejecting the flag if data_path is something unsuitable

Let me push this on top of your branch :)

Sure. How can I do that?

May 13 '24 21:05 ollmer

Sure. How can I do that? Probably I already can :) (else it should be this setting).

Realistically, I'll probably only get to it this Wednesday though, so no reason to wait for me until then haha

May 13 '24 21:05 klieret

Probably I already can :) (else it should be this setting).

Aha, I see. I've enabled that option, thanks.

May 13 '24 21:05 ollmer

Hmm, somehow pushing on this PR doesn't work, not sure why. Let me merge your PR and then apply my changes on top :)

May 27 '24 21:05 klieret

Thanks again for the very nice addition! ❤️

May 27 '24 21:05 klieret

I've highlighted your contribution in our changelog :)

May 28 '24 18:05 klieret

SWE-agent SWE-agent copied to clipboard

Speed up evaluation by caching task environments as docker images

What does this implement?

Any other comments?

Codecov Report

SWE-agent
SWE-agent copied to clipboard