VLMEvalKit icon indicating copy to clipboard operation
VLMEvalKit copied to clipboard

Distributed training fails due to GPU reassignment logic conflicting with torchrun

Open snowclipsed opened this issue 7 months ago • 2 comments

In run.py, the PR #991 includes GPU assignment logic that conflicts with torchrun's automatic GPU distribution.

I run the following command on 2x4090 setup:

torchrun --nproc-per-node=2 run.py --data CountBenchQA --model Moondream2

and get the following error:

[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 217 'peer access is not supported between these two devices'
[rank1]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank1]: Last error:
[rank1]: Cuda failure 217 'peer access is not supported between these two devices'

I run an example with Moondream2When running torchrun --nproc-per-node=2, each process should get its own GPU (GPU 0 and GPU 1), but I believe the custom GPU assignment code overrides this and causes both processes to compete for the same GPU. torchrun correctly assigns CUDA_VISIBLE_DEVICES=0 to process 0 and CUDA_VISIBLE_DEVICES=1 to process 1. The code reads this, sees only 1 GPU per process, does integer division (1÷2=0), and sets CUDA_VISIBLE_DEVICES="" for both processes. Both processes end up defaulting to GPU 0, causing NCCL peer-to-peer communication errors.

For Multi-node scenarios it should still pass because say if we pass something like --nproc-per-node=2 --nnodes=4 with 8 GPUs per node, where each process gets 4 GPUs (8÷2=4). The integer division works there: each process sees 4 GPUs and correctly assigns 2 GPUs per process. But for simple 2-GPU setups, each process only sees 1 GPU, so 1÷2=0 GPUs assigned.

Full Trace:

W0530 19:03:45.828000 27808 torch/distributed/run.py:766] 
W0530 19:03:45.828000 27808 torch/distributed/run.py:766] **
W0530 19:03:45.828000 27808 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0530 19:03:45.828000 27808 torch/distributed/run.py:766] **
RANK: 0, LOCAL_RANK: 0, WORLD_SIZE: 2,LOCAL_WORLD_SIZE: 2, CUDA_VISIBLE_DEVICES: 0
RANK: 1, LOCAL_RANK: 1, WORLD_SIZE: 2,LOCAL_WORLD_SIZE: 2, CUDA_VISIBLE_DEVICES: 1
[2025-05-30 19:04:07] ERROR - misc.py: load_env - 214: Did not detect the .env file at /workspace/VLMEvalKit/.env, failed to load. 
[2025-05-30 19:04:07] ERROR - misc.py: load_env - 214: Did not detect the .env file at /workspace/VLMEvalKit/.env, failed to load. 
[2025-05-30 19:04:07] WARNING - RUN - run.py: main - 210: --reuse is not set, will not reuse previous (before one day) temporary files
[rank0]:[W530 19:04:07.547649423 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
[2025-05-30 19:04:07] ERROR - misc.py: load_env - 214: Did not detect the .env file at /workspace/VLMEvalKit/.env, failed to load. 
[2025-05-30 19:04:07] ERROR - misc.py: load_env - 214: Did not detect the .env file at /workspace/VLMEvalKit/.env, failed to load. 
[rank1]:[W530 19:04:08.324392160 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 1]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/VLMEvalKit/run.py", line 504, in <module>
[rank1]:     main()
[rank1]:   File "/workspace/VLMEvalKit/run.py", line 264, in main
[rank1]:     dist.barrier()
[rank1]:   File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(args, **kwargs)
[rank1]:   File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4622, in barrier
[rank1]:     work = group.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
[rank1]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank1]: Last error:
[rank1]: Cuda failure 217 'peer access is not supported between these two devices'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/VLMEvalKit/run.py", line 504, in <module>
[rank0]:     main()
[rank0]:   File "/workspace/VLMEvalKit/run.py", line 264, in main
[rank0]:     dist.barrier()
[rank0]:   File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(args, kwargs)
[rank0]:   File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4622, in barrier
[rank0]:     work = group.barrier(opts=opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 217 'peer access is not supported between these two devices'
[rank1]:[W530 19:04:09.247269242 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W530 19:04:09.360202573 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0530 19:04:10.370000 27808 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 27980 closing signal SIGTERM
E0530 19:04:10.439000 27808 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 27981) of binary: /workspace/VLMEvalKit/.venv/bin/python
Traceback (most recent call last):
  File "/workspace/VLMEvalKit/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
    return f(*args, kwargs)
  File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/VLMEvalKit/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-30_19:04:10
  host      : c0d4a7c3d60b
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 27981)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

snowclipsed avatar May 31 '25 00:05 snowclipsed

+1 @kennymckormick Please help

iamlockelightning avatar Jul 10 '25 09:07 iamlockelightning

+1 @kennymckormick Please help

countless123 avatar Nov 06 '25 10:11 countless123