DiffDock
DiffDock copied to clipboard
PyTorch and Cuda
Hey!
I'm trying to install the software on our RL cluster, and got until this line
conda install pytorch==1.11.0 pytorch-cuda=11.7 -c pytorch -c nvidia
I'm guessing this needs to be adjusted to the cuda versions installed which would be 12.0 and 11.2 in my case. After that comes
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.11.0+cu117.html
Now my problem is; The cuda versions I have installed are nowhere to be seen in this https://data.pyg.org/whl/ and neither is the specified version torch-1.11.0+cu117 Do I need to install a different cuda version, or will it run with another one? Do I have to find one for PyTorch 1.11.0 or are other versions also sufficient? If not according to this site I'd need to install either Cuda 10.2, 11.3 or 11.5
How can I proceed with the installation
CUDA libraries are backward compatible. I don't know what is an "RL" cluster, but most likely the NVIDIA Display Drivers are managed by your central cluster team.
The NVIDIA Display Driver includes a CUDA user-mode driver, which you as an end-user probably cannot change. If you run nvidia-smi on the compute node, you'll be able to see the CUDA version listed in the upper right hand corner.
The "CUDA Toolkit" runtime and library are what you can put in the conda environment. This needs to be the same version, or older, than the user-mode CUDA for the best results.
Hey RJ3!
Sorry, with RL I meant Rocky Linux (8), my bad.
If I'm reading your reply correctly then it should work with any CUDA version installed >=11.7? Then it should work with the currently installed 12.2 version of CUDA, right? I am not an end user and have the permissions to install that kind of software if needed, I am just not sure which software is required to install DiffDock
When running the command as is
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.11.0+cu117.html
I get hit with an error
The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.1). Please make sure to use the same CUDA versions.
When you run nvidia-smi on the compute host, what is the CUDA version listed in the top right corner? 12.2? A lot of libraries are not compiled for CUDA 12.2 yet, but that's ok because everything is backwards compatible.
The CUDA Toolkit version in the conda env needs to match the PyTorch compiled version, whatever version that is must be equal to or less than the NVIDIA display driver user-mode driver (shown by nvidia-smi).
When running nvidia-smi the top right shows version 12.2. When trying to execute the commands for the installation process as given it fails with the following output during the installation of torch-spline-conv.
RuntimeError:
The detected CUDA version (12.1) mismatches the version that was used to compile
PyTorch (11.3). Please make sure to use the same CUDA versions.
[end of output]
The CUDA "user mode" versions installed on the system are 11.2, 12.1 & 12.2
I tried installing cudatoolkit version 11.3 in the environment, and that was successful, but it still uses version 12.1 and I am still stuck with the error mentioned above.
I haven't found out why exactly conda uses cuda version 12.1 as default or how to change that, listing the cudatoolkit versions gives this
(diffdock) conda list cudatoolkit
# packages in environment at /software/anaconda/envs/diffdock:
#
# Name Version Build Channel
cudatoolkit 11.3.1 ha36c431_9 nvidia
Could you tell me how I need to change the given commands to get this installation process to work? Alternatively as a last resort I could also install another CUDA version on the system but I'd need to be sure if I should use 11.3 or 11.7 for that.
You can check my solution here!