metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

@conda_base to recycle identical conda envs

Open crypdick opened this issue 2 years ago • 6 comments

Current behavior:

@conda_base generates a separate conda env directory for each (flow, dependencies) combo.

Requested behavior:

Metaflow should recycle conda envs if they are identical dependencies.

Background:

We are transitioning our mono-repo from pip-tools to Metaflow's @conda_base. We wrote an environment.yml parser such that we can decorate all our flows with @conda_base(parse_env()) and reuse the same set of dependencies across all flows.

This works for single flow runs. However, our pipeline CI tests are broken because each flow generates a separate (identical) 7GB conda env, quickly filling up drives.

crypdick avatar Aug 23 '22 21:08 crypdick

@crypdick conda uses hardlinks to save on disk space already. Are you seeing different behavior?

savingoyal avatar Aug 23 '22 22:08 savingoyal

@savingoyal I repeated the du commands from that link, and indeed, the two commands for d in envs/*; do du -sh $d; done vs du -sh envs/* show different values, so it's not as bad as I thought.

du -sh envs/* pkgs lib bin conda-meta share include etc: image

However, each env is still 600 MB. If these disk usages are correct, running our pytest suite locally for 20 flows still requires an unreasonable amount of space, IMO.

crypdick avatar Aug 24 '22 21:08 crypdick

Do you have any other conda package cache besides pkgs? I don't see any reason why two different environments will have the same size (636M) and not rely on a cache.

savingoyal avatar Aug 24 '22 22:08 savingoyal

Not that I'm aware of, @savingoyal . I checked /opt/ for conda/mamba cache's, didn't find anything there.

Additional info:

  • I'm using a miniconda3
  • I start with an environment.base.yml
  • I compile this to an environment.yml using mamba solver (cmds simplified for brevity):
mamba env create -f src/environment.base.yml -n tmpenv python=3.9
mamba env export -n tmpenv >> src/environment.yml
mamba env remove --name tmpenv -y

The resulting environment.yml has pinned versions and builds, so I'm not surprised that each env has an identical size. image

update: I also poked around pkg directories vs the metaflow envs running ls -lLi to see if the files are pointing to the same inodes, and they appear to be different files on disk image image

crypdick avatar Aug 25 '22 15:08 crypdick

@crypdick you can invoke conda info and mamba info to list all your package caches.

savingoyal avatar Aug 30 '22 17:08 savingoyal

Here you go @savingoyal: https://gist.github.com/crypdick/106c876a8af1f0403c8dce50b545eaef

crypdick avatar Sep 14 '22 21:09 crypdick