DiffDock icon indicating copy to clipboard operation
DiffDock copied to clipboard

PyTorch and Cuda

Open zimb3l opened this issue 1 year ago • 5 comments

Hey!

I'm trying to install the software on our RL cluster, and got until this line

conda install pytorch==1.11.0 pytorch-cuda=11.7 -c pytorch -c nvidia

I'm guessing this needs to be adjusted to the cuda versions installed which would be 12.0 and 11.2 in my case. After that comes

pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.11.0+cu117.html

Now my problem is; The cuda versions I have installed are nowhere to be seen in this https://data.pyg.org/whl/ and neither is the specified version torch-1.11.0+cu117 Do I need to install a different cuda version, or will it run with another one? Do I have to find one for PyTorch 1.11.0 or are other versions also sufficient? If not according to this site I'd need to install either Cuda 10.2, 11.3 or 11.5

How can I proceed with the installation

zimb3l avatar Aug 11 '23 13:08 zimb3l

CUDA libraries are backward compatible. I don't know what is an "RL" cluster, but most likely the NVIDIA Display Drivers are managed by your central cluster team.

The NVIDIA Display Driver includes a CUDA user-mode driver, which you as an end-user probably cannot change. If you run nvidia-smi on the compute node, you'll be able to see the CUDA version listed in the upper right hand corner.

The "CUDA Toolkit" runtime and library are what you can put in the conda environment. This needs to be the same version, or older, than the user-mode CUDA for the best results.

RJ3 avatar Aug 14 '23 16:08 RJ3

Hey RJ3!

Sorry, with RL I meant Rocky Linux (8), my bad.

If I'm reading your reply correctly then it should work with any CUDA version installed >=11.7? Then it should work with the currently installed 12.2 version of CUDA, right? I am not an end user and have the permissions to install that kind of software if needed, I am just not sure which software is required to install DiffDock

When running the command as is pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.11.0+cu117.html

I get hit with an error

The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.1). Please make sure to use the same CUDA versions.

zimb3l avatar Aug 16 '23 13:08 zimb3l

When you run nvidia-smi on the compute host, what is the CUDA version listed in the top right corner? 12.2? A lot of libraries are not compiled for CUDA 12.2 yet, but that's ok because everything is backwards compatible.

The CUDA Toolkit version in the conda env needs to match the PyTorch compiled version, whatever version that is must be equal to or less than the NVIDIA display driver user-mode driver (shown by nvidia-smi).

RJ3 avatar Aug 16 '23 13:08 RJ3

When running nvidia-smi the top right shows version 12.2. When trying to execute the commands for the installation process as given it fails with the following output during the installation of torch-spline-conv.

RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.
      
      [end of output]

The CUDA "user mode" versions installed on the system are 11.2, 12.1 & 12.2

I tried installing cudatoolkit version 11.3 in the environment, and that was successful, but it still uses version 12.1 and I am still stuck with the error mentioned above.

I haven't found out why exactly conda uses cuda version 12.1 as default or how to change that, listing the cudatoolkit versions gives this

(diffdock) conda list cudatoolkit
# packages in environment at /software/anaconda/envs/diffdock:
#
# Name                    Version                   Build  Channel
cudatoolkit               11.3.1               ha36c431_9    nvidia

Could you tell me how I need to change the given commands to get this installation process to work? Alternatively as a last resort I could also install another CUDA version on the system but I'd need to be sure if I should use 11.3 or 11.7 for that.

zimb3l avatar Aug 24 '23 11:08 zimb3l

You can check my solution here!

asarigun avatar Jan 12 '24 10:01 asarigun