open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

CUDA device cannot be loaded from pytorch

Open Rivers47 opened this issue 7 months ago • 1 comments

NVIDIA Open GPU Kernel Modules Version

575.51.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Fedora 41

Kernel Release

6.14.6

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 2070

Describe the bug

pytorch cannot find the gpu in a container environment, nvidia-smi works and is correctly showing the card.

To Reproduce

  1. Install nvidia container toolkit, from the nvidia cuda repo (https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/) I installed the following packages (not sure if all necessary for container):

kmod-nvidia-latest-dkms.x86_64 3:570.148.08-1.fc41 cuda-fedora41-x86_64 libnvidia-cfg.x86_64 3:570.148.08-1.fc41 cuda-fedora41-x86_64 libnvidia-gpucomp.x86_64 3:575.51.03-1.fc41 cuda-fedora41-x86_64 libnvidia-ml.x86_64 3:570.148.08-1.fc41 cuda-fedora41-x86_64 nvidia-driver-cuda.x86_64 3:570.148.08-1.fc41 cuda-fedora41-x86_64 nvidia-driver-cuda-libs.x86_64 3:570.148.08-1.fc41 cuda-fedora41-x86_64 nvidia-kmod-common.noarch 3:570.148.08-1.fc41 cuda-fedora41-x86_64 nvidia-modprobe.x86_64 3:575.51.03-1.fc41 cuda-fedora41-x86_64 nvidia-persistenced.x86_64 3:570.148.08-1.fc41 cuda-fedora41-x86_64

  1. Create a container (podman run --replace -it --device nvidia.com/gpu=all nvidia/cuda:12.9.0-cudnn-runtime-ubuntu24.04 /bin/bash)
  2. Use python to install pytorch in the official nvidia/cuda container (pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu128). In the python shell:
  3. import torch
  4. torch.cuda.is_available()
  5. Error log shows cuda initialization fails and device not found.

Bug Incidence

Always

nvidia-bug-report.log.gz

After I switch to the POE driver this problem disappeared, so I didn't have the chance to run it.

More Info

No response

Rivers47 avatar May 30 '25 03:05 Rivers47

Related to #797

Diatrus avatar Jun 09 '25 00:06 Diatrus