dinov2 icon indicating copy to clipboard operation
dinov2 copied to clipboard

torchrun --nproc_per_node=4 /data/zxb/DINO_FL/train/train.py

Open Emibobo opened this issue 1 year ago • 3 comments

[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. [rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1 [rank1]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank1]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info. [rank1]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1 [rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info. [rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1 W0725 08:32:26.101000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 651734 closing signal SIGTERM /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' E0725 08:32:28.020000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 651732) of binary: /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in main() File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main launch(args) File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch run(args) File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/data/zxb/DINO_FL/train/train_stage.py FAILED

Failures: [1]: time : 2024-07-25_08:32:26 host : user rank : 1 (local_rank: 1) exitcode : -6 (pid: 651733) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651733 [2]: time : 2024-07-25_08:32:26 host : user rank : 3 (local_rank: 3) exitcode : -6 (pid: 651735) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651735

Root Cause (first observed failure): [0]: time : 2024-07-25_08:32:26 host : user rank : 0 (local_rank: 0) exitcode : -6 (pid: 651732) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651732

Emibobo avatar Jul 25 '24 08:07 Emibobo

I encountered the same issue as you. May I ask if your problem has been resolved? I look forward to your reply.

SHAREN111 avatar Aug 05 '24 07:08 SHAREN111

ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives.

One of the processes got stuck while waiting on a collective operation (all_gather etc.). It's likely a problem on your side rather than the code in the repo. It's also hard to debug without seeing the code and the python environment.

Troubleshooting tips:

  • Write a mini script that simply initializes distributed and runs some collectives to check that your GPUs can talk
  • Make a list of all modifications made on top of the provided code, check that all new collectives are executed in the same order on all ranks
  • Ensure that the modified code runs on a single GPU
  • Follow https://pytorch.org/docs/master/distributed.html#debugging-torch-distributed-applications

baldassarreFe avatar Aug 08 '24 22:08 baldassarreFe

请问解决了吗

DarkJokers avatar Sep 23 '24 07:09 DarkJokers