RL icon indicating copy to clipboard operation
RL copied to clipboard

NeMo RL CUDA 13.0 failed to initiailize on RTX 6000 Blackwell

Open MattFeinberg opened this issue 1 month ago • 1 comments

Describe the bug

NeMo RL container says CUDA failed to initialize and can't run training with CUDA 13.0 on RTX 6000 Blackwell

Steps/Code to reproduce bug

I tried running NeMo RL with the following setup:

  • GPUs: 8xRTX 6000 Blackwell
  • CUDA: 13.0

When the NeMo RL container loads up, it outputs:

CUDA Version 12.9.0.043

And then:

ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[ Initialization error (error 3) ]]

I tried running training anyway, and got this error:

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

Expected behavior

Runtime error listed above

Additional context

I can successfully run nvidia-smi on the host machine and from within the NeMo RL container. They both show CUDA 13.0

MattFeinberg avatar Nov 13 '25 19:11 MattFeinberg

@MattFeinberg The current NeMo RL container, if you built it from docker/Dockerfile, it uses a cuda 12.9 container as the base container so the container doesn't have a cuda 13. To use cuda 13 you need to change the base container (indicated here) to be cuda 13. You may also need to bump torch version to 2.9 in pyproject.toml and make sure to specify --index-url=https://download.pytorch.org/whl/cu130 for torch, so that you install torch with compatible cuda version.

guyueh1 avatar Nov 24 '25 17:11 guyueh1