BasicSR icon indicating copy to clipboard operation
BasicSR copied to clipboard

Watchdog caught collective operation timeout: WorkNCCL

Open DavisMeee opened this issue 4 months ago • 0 comments

[rank1]:[E626 06:24:44.903881913 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=808, OpType=ALLREDUCE, NumelIn=9801523, NumelOut=9801523, Timeout(ms)=600000) ran for 600020 milliseconds before timing out.
[rank1]:[E626 06:24:44.906457629 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 808, last enqueued NCCL work: 811, last completed NCCL work: 807.
[rank0]:[E626 06:24:44.159932949 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=808, OpType=ALLREDUCE, NumelIn=9801523, NumelOut=9801523, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
[rank0]:[E626 06:24:44.160366013 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 808, last enqueued NCCL work: 811, last completed NCCL work: 807.
[rank2]:[E626 06:24:44.173078629 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=808, OpType=ALLREDUCE, NumelIn=9801523, NumelOut=9801523, Timeout(ms)=600000) ran for 600061 milliseconds before timing out.
[rank2]:[E626 06:24:44.173689731 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 808, last enqueued NCCL work: 811, last completed NCCL work: 807.
[rank1]:[E626 06:24:46.196456191 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 808, last enqueued NCCL work: 811, last completed NCCL work: 807.
[rank1]:[E626 06:24:46.196487002 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E626 06:24:46.196494202 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E626 06:24:46.197938165 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=808, OpType=ALLREDUCE, NumelIn=9801523, NumelOut=9801523, Timeout(ms)=600000) ran for 600020 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647352509/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0b7c76b446 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f0b290267f2 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f0b2902dc33 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f0b2902f69d in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f0b823b15c0 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f0b8cc94ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f0b8cd26a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E626 06:24:46.205045570 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 808, last enqueued NCCL work: 811, last completed NCCL work: 807.
[rank0]:[E626 06:24:46.205102526 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E626 06:24:46.205111251 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E626 06:24:46.206491346 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=808, OpType=ALLREDUCE, NumelIn=9801523, NumelOut=9801523, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647352509/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff56db6b446 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ff51a4267f2 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff51a42dc33 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff51a42f69d in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7ff5737f55c0 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7ff57e094ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7ff57e126a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E626 06:24:46.322154692 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 808, last enqueued NCCL work: 811, last completed NCCL work: 807.
[rank2]:[E626 06:24:46.322184713 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E626 06:24:46.322192411 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E626 06:24:46.323698949 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=808, OpType=ALLREDUCE, NumelIn=9801523, NumelOut=9801523, Timeout(ms)=600000) ran for 600061 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647352509/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f606cf6b446 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f60198267f2 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f601982dc33 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f601982f69d in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f6072bf55c0 in /opt/conda/envs/env/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7f607d494ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f607d526a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0626 06:24:59.641000 96890 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 96929 closing signal SIGTERM
W0626 06:24:59.644000 96890 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 96930 closing signal SIGTERM
W0626 06:24:59.644000 96890 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 96932 closing signal SIGTERM
/opt/conda/envs/env/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 26 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/envs/env/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 26 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/envs/env/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 26 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
E0626 06:25:00.677000 96890 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 2 (pid: 96931) of binary: /opt/conda/envs/env/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/torch/distributed/launch.py", line 208, in <module>
    main()
  File "/opt/conda/envs/env/lib/python3.10/site-packages/typing_extensions.py", line 2853, in wrapper
    return arg(*args, **kwargs)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/torch/distributed/launch.py", line 204, in main
    launch(args)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in launch
    run(args)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/opt/conda/envs/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
DiffIR/train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-06-26_06:24:59
  host      : normandy-mlflow-debug-hpdras-027691-cd92s
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 96931)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 96931
=======================================================

DavisMeee avatar Jun 26 '25 06:06 DavisMeee