envd icon indicating copy to clipboard operation
envd copied to clipboard

feat(cache): separate python installation from base image by adding pre-built remote cache

Open Kaiyang-Chen opened this issue 2 years ago • 5 comments

Description

For current LLB compilation (show in below figure), we pull the base(custom) image in first layer, which means if the user change the base image (using different cuda version / different os, etc.), all caches from previous built will miss. e Under my network condition, the user group creation & python installation with conda took around 1 minute. I think such operation can be sped up by leveraging the pre-built remote cache for different python version from a fixed image. Demonstrate in the figure below, whenever the user is changing the base image, we can simply pull the llb.Diff(fixStage, pythonStage) caches and perform llb.Merge() between it and the base image. d

Potential problems

  • Not sure whether buildkit support output the llb.Diff() layer, but if not, can walk around by caching pythonStage and do the llb.Diff() manually.
  • The above method modify etc/passwd & etc/usergroup when creating user group, when merging with base image, if conflict exist in such files between different os, there might be problems.

Other thoughts

If outputting llb.Diff() layer is possible, we might be able to pre-built caches for large package like pytorch, cuda-related components and use them as plug-in for base image. Since package downloading take significant time when building up docker environment, this should speed up the build process a lot.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Kaiyang-Chen avatar Dec 30 '22 23:12 Kaiyang-Chen

Thanks for your contribution! I think the core problem here is at buildkit side, how we can inspect the llb.Diff node, and whether it's possible to export it separately. Can you raise the question at the buildkit repo and link it here also? Thanks!

VoVAllen avatar Jan 03 '23 02:01 VoVAllen

  • LLB Merge could be problematic when there are some overlapped directories.
  • Maintaining remote cache for different Python versions need also consider the security update.

You need to check the v1 graph. It should support Python w/wo Conda/Mamba.

kemingy avatar Jan 03 '23 03:01 kemingy

Thanks for the proposal!

We can optimize the workflow further. For example, we can investigate if we could merge the pytorch/tensorflow package into the environment image directly, instead of downloading and installing it from pypi.

The tf/torch packages are too large. it may be faster to keep a remote cache for them.

gaocegege avatar Jan 03 '23 08:01 gaocegege

And also starship package, it used github domain to host packages, which is hard to install when network issue exists when we don't have cache

VoVAllen avatar Jan 03 '23 09:01 VoVAllen

Yep. starship. It is hard to install here in CN.

gaocegege avatar Jan 03 '23 09:01 gaocegege