Distributed training fails due to GPU reassignment logic conflicting with torchrun
In run.py, the PR #991 includes GPU assignment logic that conflicts with torchrun's automatic GPU distribution.
I run the following command on 2x4090 setup:
torchrun --nproc-per-node=2 run.py --data CountBenchQA --model Moondream2
and get the following error:
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 217 'peer access is not supported between these two devices'
[rank1]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank1]: Last error:
[rank1]: Cuda failure 217 'peer access is not supported between these two devices'
I run an example with Moondream2When running torchrun --nproc-per-node=2, each process should get its own GPU (GPU 0 and GPU 1), but I believe the custom GPU assignment code overrides this and causes both processes to compete for the same GPU. torchrun correctly assigns CUDA_VISIBLE_DEVICES=0 to process 0 and CUDA_VISIBLE_DEVICES=1 to process 1. The code reads this, sees only 1 GPU per process, does integer division (1÷2=0), and sets CUDA_VISIBLE_DEVICES="" for both processes. Both processes end up defaulting to GPU 0, causing NCCL peer-to-peer communication errors.
For Multi-node scenarios it should still pass because say if we pass something like --nproc-per-node=2 --nnodes=4 with 8 GPUs per node, where each process gets 4 GPUs (8÷2=4). The integer division works there: each process sees 4 GPUs and correctly assigns 2 GPUs per process. But for simple 2-GPU setups, each process only sees 1 GPU, so 1÷2=0 GPUs assigned.
Full Trace:
W0530 19:03:45.828000 27808 torch/distributed/run.py:766]
W0530 19:03:45.828000 27808 torch/distributed/run.py:766] **
W0530 19:03:45.828000 27808 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0530 19:03:45.828000 27808 torch/distributed/run.py:766] **
RANK: 0, LOCAL_RANK: 0, WORLD_SIZE: 2,LOCAL_WORLD_SIZE: 2, CUDA_VISIBLE_DEVICES: 0
RANK: 1, LOCAL_RANK: 1, WORLD_SIZE: 2,LOCAL_WORLD_SIZE: 2, CUDA_VISIBLE_DEVICES: 1
[2025-05-30 19:04:07] ERROR - misc.py: load_env - 214: Did not detect the .env file at /workspace/VLMEvalKit/.env, failed to load.
[2025-05-30 19:04:07] ERROR - misc.py: load_env - 214: Did not detect the .env file at /workspace/VLMEvalKit/.env, failed to load.
[2025-05-30 19:04:07] WARNING - RUN - run.py: main - 210: --reuse is not set, will not reuse previous (before one day) temporary files
[rank0]:[W530 19:04:07.547649423 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
[2025-05-30 19:04:07] ERROR - misc.py: load_env - 214: Did not detect the .env file at /workspace/VLMEvalKit/.env, failed to load.
[2025-05-30 19:04:07] ERROR - misc.py: load_env - 214: Did not detect the .env file at /workspace/VLMEvalKit/.env, failed to load.
[rank1]:[W530 19:04:08.324392160 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 1] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/VLMEvalKit/run.py", line 504, in <module>
[rank1]: main()
[rank1]: File "/workspace/VLMEvalKit/run.py", line 264, in main
[rank1]: dist.barrier()
[rank1]: File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(args, **kwargs)
[rank1]: File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4622, in barrier
[rank1]: work = group.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
[rank1]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank1]: Last error:
[rank1]: Cuda failure 217 'peer access is not supported between these two devices'
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/VLMEvalKit/run.py", line 504, in <module>
[rank0]: main()
[rank0]: File "/workspace/VLMEvalKit/run.py", line 264, in main
[rank0]: dist.barrier()
[rank0]: File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(args, kwargs)
[rank0]: File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4622, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 217 'peer access is not supported between these two devices'
[rank1]:[W530 19:04:09.247269242 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W530 19:04:09.360202573 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0530 19:04:10.370000 27808 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 27980 closing signal SIGTERM
E0530 19:04:10.439000 27808 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 27981) of binary: /workspace/VLMEvalKit/.venv/bin/python
Traceback (most recent call last):
File "/workspace/VLMEvalKit/.venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, kwargs)
File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-05-30_19:04:10
host : c0d4a7c3d60b
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 27981)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
+1 @kennymckormick Please help
+1 @kennymckormick Please help