NeMo RL CUDA 13.0 failed to initiailize on RTX 6000 Blackwell
Describe the bug
NeMo RL container says CUDA failed to initialize and can't run training with CUDA 13.0 on RTX 6000 Blackwell
Steps/Code to reproduce bug
I tried running NeMo RL with the following setup:
- GPUs: 8xRTX 6000 Blackwell
- CUDA: 13.0
When the NeMo RL container loads up, it outputs:
CUDA Version 12.9.0.043
And then:
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[ Initialization error (error 3) ]]
I tried running training anyway, and got this error:
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
Expected behavior
Runtime error listed above
Additional context
I can successfully run nvidia-smi on the host machine and from within the NeMo RL container. They both show CUDA 13.0
@MattFeinberg The current NeMo RL container, if you built it from docker/Dockerfile, it uses a cuda 12.9 container as the base container so the container doesn't have a cuda 13. To use cuda 13 you need to change the base container (indicated here) to be cuda 13. You may also need to bump torch version to 2.9 in pyproject.toml and make sure to specify --index-url=https://download.pytorch.org/whl/cu130 for torch, so that you install torch with compatible cuda version.