OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Docs]: Documenting CUDA Toolkit installation on Linux

Open Arcitec opened this issue 4 months ago • 0 comments

I'm not sure whether we should put this information on the wiki, in README, or anywhere else.

It can be pretty difficult to get the required CUDA Toolkit version on Linux, since it's common that only the latest version (12 at the moment) is shipped with the OS.

Users will see errors such as this (including so people can find this thread via search):

Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory

Or this (usually when running natively via Venv):

venv/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:84.)

Torch tries to work around any issues by using its own CUDA Toolkit which its requirements.txt installs via Python's PyPi packages, but it's a bit suboptimal since it's still missing some files as seen in the example above. (Edit: Have now tested the newest PyTorch for CUDA 12.4, and it no longer has that issue.)


Manually installing old CUDA Toolkits side-by-side in the OS itself is possible via NVIDIA's special installers, but can lead to issues due to older libraries overwriting some newer ones. So I don't recommend installing old CUDA Toolkits in the host OS.

The easiest solution for users is to have Conda on their system, running ./install.sh so that the OneTrainer Conda environment is created, and then executing this command inside the OneTrainer directory to install CUDA Toolkit 11.8 directly into the conda_env directory:

conda install -y --prefix "conda_env" --channel "nvidia/label/cuda-11.8.0" cuda-toolkit

Conda makes it easy to provide the necessary CUDA Toolkit versions without needing to have anything on the host OS.

That's currently the newest 11.x version. And the best way to know what other toolkits are available is to look at this list:

https://anaconda.org/nvidia/cuda/labels

Arcitec avatar Oct 01 '24 21:10 Arcitec