GPU numbers cannot be specified for multiple GPUs

Open Baojch opened this issue 1 month ago • 1 comments

For single GPU training, I can export CUDA_VISIBLE_DEVICES, and train with headless on the specific device. But on multi-GPU training, if I don't use export CUDA_VISIBLE_DEVICES, and write in --devices or in train.py. It still use gpu 0 & 1. Besides, if I use CUDA_VISIBLE_DEVICES, no active gpus. python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/rsl_rl/train.py \ --task=xxx --headless --distributed

And if CUDA_VISIBLE_DEVICES=0,5, we can only see cuda:0 active.

Any solutions?

`2025-11-13 10:17:55 [4,939ms] [Error] [gpu.foundation.plugin] No device could be created. Some known system issues:

The driver is not installed properly and requires a clean re-install.
Your GPUs do not support RayTracing: DXR or Vulkan ray_tracing, or hardware is excluded due to performance.
The driver cannot enumerate any GPU: driver, display, TCC mode or a docker issue. For Vulkan, test it with Vulkaninfo tool from Vulkan SDK, instead of nvidia-smi.
For Ubuntu, it requires server-xorg-core 1.20.7+ and a display to work without --no-window.
For Linux dockers, the setup is not complete. Install the latest driver, xServer and NVIDIA container runtime.`

Nov 13 '25 10:11 Baojch

Seems there's issue with your driver. From your screen shot, at the top it said your CUDA was in bad state. Try to re-install your driver or update to a newer version to see if that helps.

Nov 13 '25 19:11 PeterL-NV