GPU numbers cannot be specified for multiple GPUs
For single GPU training, I can export CUDA_VISIBLE_DEVICES, and train with headless on the specific device.
But on multi-GPU training, if I don't use export CUDA_VISIBLE_DEVICES, and write in --devices or in train.py. It still use gpu 0 & 1.
Besides, if I use CUDA_VISIBLE_DEVICES, no active gpus.
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 scripts/rsl_rl/train.py \ --task=xxx --headless --distributed
And if CUDA_VISIBLE_DEVICES=0,5, we can only see cuda:0 active.
Any solutions?
`2025-11-13 10:17:55 [4,939ms] [Error] [gpu.foundation.plugin] No device could be created. Some known system issues:
- The driver is not installed properly and requires a clean re-install.
- Your GPUs do not support RayTracing: DXR or Vulkan ray_tracing, or hardware is excluded due to performance.
- The driver cannot enumerate any GPU: driver, display, TCC mode or a docker issue. For Vulkan, test it with Vulkaninfo tool from Vulkan SDK, instead of nvidia-smi.
- For Ubuntu, it requires server-xorg-core 1.20.7+ and a display to work without --no-window.
- For Linux dockers, the setup is not complete. Install the latest driver, xServer and NVIDIA container runtime.`
Seems there's issue with your driver. From your screen shot, at the top it said your CUDA was in bad state. Try to re-install your driver or update to a newer version to see if that helps.