clearml-agent icon indicating copy to clipboard operation
clearml-agent copied to clipboard

torch version inference logic broken when torchvision is specified

Open H4dr1en opened this issue 4 years ago • 5 comments

If I start an experiment with the following requirements defined in the UI:

torch==1.3.1

The installation works well, But if I use the following requirements:

torch==1.3.1
torchvision==0.2.1

Then it fails trying to install torch==0.2.1 after installing torch==1.3.1. Probably the parsing of the version of torchvision has an error?

Here is the full log of the error:

Requirement already up-to-date: pip==20.1 in /home/H4dr1en/.trains/venvs-builds/3.7/lib/python3.7/site-packages (20.1)
Collecting Cython
  Using cached Cython-0.29.17-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.17
Collecting torch==1.3.1+cpu
  File was already downloaded /home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl
Successfully downloaded torch
Collecting torch==0.2.1
  ERROR: HTTP error 403 while getting http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
  ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of HTTP error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl for URL http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
trains_agent: ERROR: Could not download wheel name of "http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl"
ERROR: Double requirement given: torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsipcp8nfs.txt (line 2)) (already in torch==1.3.1+cpu from file:///home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.3.1%2Bcpu-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsipcp8nfs.txt (line 1)), name='torch')
trains_agent: ERROR: Could not install task requirements!
Command '['/home/H4dr1en/.trains/venvs-builds/3.7/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsipcp8nfs.txt']' returned non-zero exit status 1.
DONE: Running task 'c63fc150ff5049c4939cd6a37f3d30a8', exit status 1

System: Linux Debian 9 Cuda: not installed (no gpu)

H4dr1en avatar May 05 '20 15:05 H4dr1en

Hi @H4dr1en Torch is a special case for trains-agent, since the good people of pytorch are actually maintaining packages for different CUDA versions, the trains-agent will automatically select the correct package based on the installed CUDA.

Specifically it seems that you are running without a GPU, so cuda version is 0. It seems to find the correct package for torch==1.3.1, but fails on torchvision, the thing is it tries to download "torch" not "torchvision" ... Let me see if I can reproduce this behavior ..

EDIT: @H4dr1en, What is the trains-agent version you are using? What is the package manager trains-agent is using ? see example here What is the pip version limit configured in trains.conf? see example here

bmartinn avatar May 05 '20 19:05 bmartinn

Hi @H4dr1en Could you test with trains-agent 0.14.2rc2

pip install trains-agent==0.14.2rc2

I think the problem is that there is no package for torchvision==0.2.0 You can see in the full list here: https://download.pytorch.org/whl/cpu/torch_stable.html

Notice that you can just reset the experiment and edit the requirements to the correct torchvision version :)

bmartinn avatar May 05 '20 20:05 bmartinn

With trains-agent==0.14.2rc2 it also fails:

Collecting Cython
  Using cached Cython-0.29.17-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.17
Collecting torch==1.3.1+cpu
  File was already downloaded /home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl
Successfully downloaded torch
Collecting torch==0.2.1
  ERROR: HTTP error 403 while getting http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
  ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of HTTP error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl for URL http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
trains_agent: ERROR: Could not download wheel name of "http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl"
ERROR: Double requirement given: torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsx0eu_ber.txt (line 2)) (already in torch==1.5.0+cpu from file:///home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.5.0%2Bcpu-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsx0eu_ber.txt (line 1)), name='torch')
trains_agent: ERROR: Could not install task requirements!
Command '['/home/H4dr1en/.trains/venvs-builds/3.7/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsx0eu_ber.txt']' returned non-zero exit status 1.
DONE: Running task '63d740ab6fbd4178ad55243df1c4cf07', exit status 1

I think the problem is that there is no package for torchvision==0.2.0

Would it be reasonable to install torchvision (and torch) using pypi repo as a fallback when trains-agent cannot infer the package based on the version of CUDA and torch/torchvision?

In any case, the error should be more meaningfull (currently misleading since it tries to install torch, not torchvision with the version provided for torchvision)

H4dr1en avatar May 06 '20 06:05 H4dr1en

Yes you are correct, I'll make sure the error message will be corrected in the next RC.

Regrading using pypi with torch, the problem is, this is unstabe, for example there is no way of knowing whether the torchvision on pypi is the CPU or the GPU version... Also for the GPU version, the CUDA version changes from one torch version to another, so you end up with driver mismatch with no good reason.

With all that said, if you know what's the correct version for your setup, you can simple replace the torchvision==0.2.1 with a direct https link to the wheel: https://files.pythonhosted.org/packages/ca/0d/f00b2885711e08bd71242ebe7b96561e6f6d01fdb4b9dcf4d37e2e13c5e1/torchvision-0.2.1-py2.py3-none-any.whl This would work, as long as it matches the CPU/CUDA version you are running .

bmartinn avatar May 06 '20 07:05 bmartinn

Regrading using pypi with torch, the problem is, this is unstabe, for example there is no way of knowing whether the torchvision on pypi is the CPU or the GPU version... Also for the GPU version, the CUDA version changes from one torch version to another, so you end up with driver mismatch with no good reason.

Thank you for pointing that out, this definitely makes sense!

With all that said, if you know what's the correct version for your setup, you can simple replace the torchvision==0.2.1 with a direct https link to the wheel:

Thanks for the workaround! I'll close as soon as the error is more explicit 👍

EDIT: @H4dr1en, What is the trains-agent version you are using? What is the package manager trains-agent is using ? see example here What is the pip version limit configured in trains.conf? see example here

train-agent==0.14.2rc2 package manager = pip pip version = 0.21

H4dr1en avatar May 06 '20 16:05 H4dr1en