clearml-agent
clearml-agent copied to clipboard
torch version inference logic broken when torchvision is specified
If I start an experiment with the following requirements defined in the UI:
torch==1.3.1
The installation works well, But if I use the following requirements:
torch==1.3.1
torchvision==0.2.1
Then it fails trying to install torch==0.2.1
after installing torch==1.3.1
. Probably the parsing of the version of torchvision
has an error?
Here is the full log of the error:
Requirement already up-to-date: pip==20.1 in /home/H4dr1en/.trains/venvs-builds/3.7/lib/python3.7/site-packages (20.1)
Collecting Cython
Using cached Cython-0.29.17-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.17
Collecting torch==1.3.1+cpu
File was already downloaded /home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl
Successfully downloaded torch
Collecting torch==0.2.1
ERROR: HTTP error 403 while getting http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of HTTP error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl for URL http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
trains_agent: ERROR: Could not download wheel name of "http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl"
ERROR: Double requirement given: torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsipcp8nfs.txt (line 2)) (already in torch==1.3.1+cpu from file:///home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.3.1%2Bcpu-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsipcp8nfs.txt (line 1)), name='torch')
trains_agent: ERROR: Could not install task requirements!
Command '['/home/H4dr1en/.trains/venvs-builds/3.7/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsipcp8nfs.txt']' returned non-zero exit status 1.
DONE: Running task 'c63fc150ff5049c4939cd6a37f3d30a8', exit status 1
System: Linux Debian 9 Cuda: not installed (no gpu)
Hi @H4dr1en Torch is a special case for trains-agent, since the good people of pytorch are actually maintaining packages for different CUDA versions, the trains-agent will automatically select the correct package based on the installed CUDA.
Specifically it seems that you are running without a GPU, so cuda version is 0. It seems to find the correct package for torch==1.3.1, but fails on torchvision, the thing is it tries to download "torch" not "torchvision" ... Let me see if I can reproduce this behavior ..
EDIT:
@H4dr1en, What is the trains-agent version you are using?
What is the package manager trains-agent is using ? see example here
What is the pip version limit configured in trains.conf
? see example here
Hi @H4dr1en Could you test with trains-agent 0.14.2rc2
pip install trains-agent==0.14.2rc2
I think the problem is that there is no package for torchvision==0.2.0
You can see in the full list here: https://download.pytorch.org/whl/cpu/torch_stable.html
Notice that you can just reset the experiment and edit the requirements to the correct torchvision version :)
With trains-agent==0.14.2rc2
it also fails:
Collecting Cython
Using cached Cython-0.29.17-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.17
Collecting torch==1.3.1+cpu
File was already downloaded /home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl
Successfully downloaded torch
Collecting torch==0.2.1
ERROR: HTTP error 403 while getting http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of HTTP error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl for URL http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
trains_agent: ERROR: Could not download wheel name of "http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl"
ERROR: Double requirement given: torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsx0eu_ber.txt (line 2)) (already in torch==1.5.0+cpu from file:///home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.5.0%2Bcpu-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsx0eu_ber.txt (line 1)), name='torch')
trains_agent: ERROR: Could not install task requirements!
Command '['/home/H4dr1en/.trains/venvs-builds/3.7/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsx0eu_ber.txt']' returned non-zero exit status 1.
DONE: Running task '63d740ab6fbd4178ad55243df1c4cf07', exit status 1
I think the problem is that there is no package for torchvision==0.2.0
Would it be reasonable to install torchvision
(and torch
) using pypi repo as a fallback when trains-agent cannot infer the package based on the version of CUDA and torch/torchvision?
In any case, the error should be more meaningfull (currently misleading since it tries to install torch, not torchvision with the version provided for torchvision)
Yes you are correct, I'll make sure the error message will be corrected in the next RC.
Regrading using pypi with torch, the problem is, this is unstabe, for example there is no way of knowing whether the torchvision on pypi is the CPU or the GPU version... Also for the GPU version, the CUDA version changes from one torch version to another, so you end up with driver mismatch with no good reason.
With all that said, if you know what's the correct version for your setup, you can simple replace the torchvision==0.2.1
with a direct https link to the wheel:
https://files.pythonhosted.org/packages/ca/0d/f00b2885711e08bd71242ebe7b96561e6f6d01fdb4b9dcf4d37e2e13c5e1/torchvision-0.2.1-py2.py3-none-any.whl
This would work, as long as it matches the CPU/CUDA version you are running .
Regrading using pypi with torch, the problem is, this is unstabe, for example there is no way of knowing whether the torchvision on pypi is the CPU or the GPU version... Also for the GPU version, the CUDA version changes from one torch version to another, so you end up with driver mismatch with no good reason.
Thank you for pointing that out, this definitely makes sense!
With all that said, if you know what's the correct version for your setup, you can simple replace the torchvision==0.2.1 with a direct https link to the wheel:
Thanks for the workaround! I'll close as soon as the error is more explicit 👍
EDIT: @H4dr1en, What is the trains-agent version you are using? What is the package manager trains-agent is using ? see example here What is the pip version limit configured in trains.conf? see example here
train-agent==0.14.2rc2 package manager = pip pip version = 0.21