cudarc icon indicating copy to clipboard operation
cudarc copied to clipboard

CUDA_ERROR_NO_DEVICE "no CUDA-capable device is detected"

Open EricLBuehler opened this issue 1 year ago • 10 comments

Hello all,

Thanks for your great work here! When I run using cudarc, I get the error:

called `Result::unwrap()` on an `Err` value: Cuda(Cuda(DriverError(CUDA_ERROR_NO_DEVICE, "no CUDA-capable device is detected")))

Here is my system information:

$ nvidia-smi
Tue Jun 11 23:53:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.72                 Driver Version: 536.45       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro M2000M                  On  | 00000000:01:00.0 Off |                  N/A |
| N/A    0C    P8              N/A / 200W |      0MiB /  4096MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        33      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
5.0

$ echo $CUDA_VISIBLE_DEVICES
0

I would appreciate any help!

EricLBuehler avatar Jun 12 '24 03:06 EricLBuehler

Is pytorch able to see the GPU? Also what cuda toolkit version is being targeted by cudarc (if using cuda-version-from-build-system, is it being compiled on this machine?)

chelsea0x3b avatar Jun 12 '24 20:06 chelsea0x3b

@EricLBuehler any more information on this issue? Will close in a week if not

chelsea0x3b avatar Jul 16 '24 19:07 chelsea0x3b

@coreylowman sorry for not getting back! I am running this on my GPU and Pytorch can see it (torch.cuda.is_available() == True).

EricLBuehler avatar Jul 16 '24 19:07 EricLBuehler

@EricLBuehler are there any differences with dynamic loading vs dynamic linking features for cudarc? Also curious about what toolkit version you are targeting in cudarc features

chelsea0x3b avatar Jul 16 '24 19:07 chelsea0x3b

I am using cuda-version-from-build-system and dynamic-linking. How should I try dynamic loading?

EricLBuehler avatar Jul 16 '24 19:07 EricLBuehler

If you don't enable the dynamic-linking feature it will use dynamic loading.

🤔 Could you try targeting 12.2 (cuda-12020) instead of version from build system? Just curious if that would change anything.

chelsea0x3b avatar Jul 16 '24 19:07 chelsea0x3b

Hmm yeah, same error. Current:

cudarc = { version = "0.11.5", features = ["std", "cublas", "cublaslt", "curand", "driver", "nvrtc", "f16", "cuda-12020"], default-features=false }

EricLBuehler avatar Jul 16 '24 19:07 EricLBuehler

I got nothing off the top of my head. Do you get this error if you git clone cudarc and try to run the unit tests?

cargo test --tests --no-default-features -F std,cuda-12050,driver

Is this running inside a docker container?

If that doesn't work I'd probably try to go to c++ level and verify a simple example there that links to cuda finds gpu. If that doesn't work then that at least tells us that pytorch is doing something special that we need to copy.

chelsea0x3b avatar Jul 16 '24 20:07 chelsea0x3b

Hi both, I also have as similar error:

DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Aborted [jzhao399@atl1-1-02-018-25-0 release]$ which nvidia-smi /usr/bin/nvidia-smi [jzhao399@atl1-1-02-018-25-0 release]$ nvidia-smi Wed Jul 17 11:25:54 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-PCIE-40GB On | 00000000:C1:00.0 Off | 0 | | N/A 34C P0 43W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

via PyTorch, this can be solved but not sure how to solve here.

Thanks,

Jianshu

jianshu93 avatar Jul 17 '24 15:07 jianshu93

FWIW PyTorch bundles the CUDA runtime with fat binaries they compile AFAIK? So that'd be more of a static build vs cudarc here relying on dynamic linking?

AFAIK (and I don't know much on the topic), with docker containers your project needs:

  • Container: CUDA runtime libs (where cudarc and PyTorch package would differ)
  • Host: The supporting driver which nvidia-smi interacts with.

The container is then run with some extra config to add support for the GPU which mounts some extra libs/devices (which makes nvidia-smi work within the container IIRC).

polarathene avatar Mar 19 '25 21:03 polarathene

NVIDIA-SMI 535.72   Driver Version: 536.45   CUDA Version: 12.2
Quadro M2000M

Cuda compilation tools, release 12.5, V12.5.40

Are you still able to reproduce this issue?

  • It looks like you had kernel and CUDA runtime driver correctly aligned on the system (nvidia-smi output), but were building with CUDA 12.5 (nvcc --version)?
  • The Quadro M2000M GPU is a Maxwell GM107 model, limited to CC 5.0 / sm_50. There shouldn't be any CUDA compat issues there with the different CUDA 12.x versions? 🤔
  • Your follow-up comment confirmed the same failure when building cudarc with CUDA 12.2 target, but with dynamic loading instead of switching to dynamic linking.

Reproduction conditions weren't entirely clear. Potentially it was due to your compilation with NVCC (CUDA 12.5) and lack of compat package on the runtime host? (CUDA 12.2)


This explanation notes that you should be fine when nvcc builds for an older version of the CUDA runtime than you're system is running. But when it's the other way around, you can run into a problem and need to install the compat packages instead.

For additional clarity:

$ docker run --rm -it --gpus all fedora:41

# Just like when running on my container host (WSL2):
$ nvidia-smi --version
NVIDIA-SMI version  : 550.54.14
NVML version        : 550.54
DRIVER version      : 551.78
CUDA Version        : 12.4

$ nvcc --version
bash: nvcc: command not found

# Install nvidia's CUDA repo for Fedora 41:
$ dnf config-manager addrepo --from-repofile https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/cuda-fedora41.repo

# Install NVCC with CUDA 12.9:
$ dnf install -yq cuda-nvcc-12-9

$ /usr/local/cuda-12.9/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

Now watch what happens for nvidia-smi output when I use the compat package for CUDA 12.9:

$ dnf install -yq cuda-compat-12-9

# Use the compat libs instead (now the CUDA runtime version is bumped):
$ LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat nvidia-smi --version
NVIDIA-SMI version  : 550.54.14
NVML version        : 550.54
DRIVER version      : 551.78
CUDA Version        : 12.9

# For reference here are the files the compat package is providing:
$ ls /usr/local/cuda-12.9/compat
libcuda.so    libcuda.so.575.57.08  libcudadebugger.so.575.57.08  libnvidia-nvvm.so.575.57.08  libnvidia-pkcs11-openssl3.so.575.57.08  libnvidia-ptxjitcompiler.so.575.57.08
libcuda.so.1  libcudadebugger.so.1  libnvidia-nvvm.so.4           libnvidia-nvvm70.so.4        libnvidia-ptxjitcompiler.so.1

Additional references:

  • https://en.wikipedia.org/wiki/CUDA#GPUs_supported
  • https://docs.nvidia.com/deploy/cuda-compatibility/#use-the-right-cuda-forward-compatibility-package

polarathene avatar Jun 18 '25 05:06 polarathene

Closing cause stale and likely toolkit installation issue. Please repoen with additional details

chelsea0x3b avatar Nov 01 '25 22:11 chelsea0x3b