clearml-agent
clearml-agent copied to clipboard
torch version updated when cloning experiment
hi,
i ran a small experiment from a conda environment with pytorch == 1.10.1 and then tried to run its exact clone from web ui. When it runs from web ui, pytorch version is increased to 1.10.2 causing conda to download this package. The problem is twofold - first, the environment is different from expected, second - pytorch download takes time which i would like to avoid (when unnecessary). Since version 1.10.1 was already cached in conda packages dir, using it would make the environment setup much faster.
i run clearml-agent on the same machine as the original experiment, and using same conda environment. It is configured to use the same cuda and cudnn versions as the original experiment. So i dont understand why it should attempt to download a different torch version than the original.
Trying to dig deeper using the agent logs, i notice that when creating an environment for the cloned experiment it runs "conda update" pointing to a requirements file which is similar to the original experiment with a small difference: pytorch==1.10.1 is changed to pytorch~=1.10.1 . This to my understanding is why conda downloads a different pytorch version.
i tried setting "conda_full_env_update: false" in the agent config but it doesnt help. can this issue be avoided?
for your reference, the Installed Packages of the original experiment in web ui shows:
# Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
clearml == 1.1.6
hydra_core == 1.1.1
numpy == 1.21.2
omegaconf == 2.1.1
torch == 1.10.1
torchvision == 0.11.2
The temporary file created by the agent when running the cloned experiment contains;
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit~=11.3.1
- numpy~=1.21.2
- pytorch~=1.10.1
- torchvision~=0.11.2
Best Regards and Thank You for the Great Product
I think there was a commit fixing this issue, basically conda found the incorrect torch on the "default" channel, and then the pip fallback kicked in (resulting in the download, which is, btw, cached, so you are not actually downloading the same package twice)
See here the correct order of conda channels, what do you have in your "clearml.conf" ?
Also can you verify with the latest RC 1.2.0rc1
?
Thank you for the kind words @ywyga 😍 this really does make a difference!
Hi,
thank you for the quick response.
in clearml.conf i have conda_channels: ["pytorch", "conda-forge", "defaults", ] same as in the reference link you wrote.
also i checked RC 1.2.0rc as you suggested and the issue is still there.
i realize that the download is a one-time occurrence. But still i think that the agent should not be changing package version without a good reason, for the sake of reproducibility. Or perhaps this can be controlled by some config flag.
From the docs, my expectation was that "conda_full_env_update: false" would prevent any update of the environment packages but it looks like even with this flag conda is allowed to upgrade the (minor) version of the packages using "~=" syntax.