dinov2
dinov2 copied to clipboard
torchrun --nproc_per_node=4 /data/zxb/DINO_FL/train/train.py
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank1]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank1]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info.
[rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
W0725 08:32:26.101000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 651734 closing signal SIGTERM
/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
E0725 08:32:28.020000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 651732) of binary: /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/data/zxb/DINO_FL/train/train_stage.py FAILED
Failures: [1]: time : 2024-07-25_08:32:26 host : user rank : 1 (local_rank: 1) exitcode : -6 (pid: 651733) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651733 [2]: time : 2024-07-25_08:32:26 host : user rank : 3 (local_rank: 3) exitcode : -6 (pid: 651735) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651735
Root Cause (first observed failure): [0]: time : 2024-07-25_08:32:26 host : user rank : 0 (local_rank: 0) exitcode : -6 (pid: 651732) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651732
I encountered the same issue as you. May I ask if your problem has been resolved? I look forward to your reply.
ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives.
One of the processes got stuck while waiting on a collective operation (all_gather etc.). It's likely a problem on your side rather than the code in the repo. It's also hard to debug without seeing the code and the python environment.
Troubleshooting tips:
- Write a mini script that simply initializes distributed and runs some collectives to check that your GPUs can talk
- Make a list of all modifications made on top of the provided code, check that all new collectives are executed in the same order on all ranks
- Ensure that the modified code runs on a single GPU
- Follow https://pytorch.org/docs/master/distributed.html#debugging-torch-distributed-applications
请问解决了吗