NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

NCCL TImeout

Open aditya-sanas opened this issue 7 months ago • 1 comments

Describe the bug I am getting NCCL timeout issue while training the model. The code usually runs for 40k epochs and then fails with the below error:

[rank2]:[E513 13:25:57.714781669 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.                                                                                                           
[rank1]:[E513 13:25:57.714777142 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.                                                                                                           
[rank0]:[E513 13:25:57.714786872 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut
=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.                                                                                                            
[rank2]:[E513 13:25:57.715039178 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.                                                                                                                            
[rank2]:[E513 13:25:57.715055330 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.                                                                                                                                                                           
[rank2]:[E513 13:25:57.715061964 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank2]:[E513 13:25:57.715068256 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.                                                
[rank0]:[E513 13:25:57.715087695 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 532341, last en
queued NCCL work: 532341, last completed NCCL work: 532340.                                                                                                                            
[rank1]:[E513 13:25:57.715092955 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.                                                                                                                            
[rank0]:[E513 13:25:57.715112458 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 532341, last enqueued NCCL work: 532341, last completed NCCL w
ork: 532340.                                                                                                                                                                           
[rank1]:[E513 13:25:57.715120352 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.                                                                                                                                                                           
[rank1]:[E513 13:25:57.715130246 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank0]:[E513 13:25:57.715136289 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank1]:[E513 13:25:57.715137895 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.                                                
[rank0]:[E513 13:25:57.715143973 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.                                                
[rank2]:[E513 13:25:57.716347552 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.          
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):                                                                
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6aa9ca0446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)                                                                                                                                                        
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f6aaafa5672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                                                                
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6aaafacab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)                                                                                                                                                          
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6aaafae51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)                                                                                                                                                         
frame #4: <unknown function> + 0x145c0 (0x7f6af7bd95c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)                                                                                                                                                                                    
frame #5: <unknown function> + 0x8609 (0x7f6afacee609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f6afaab9353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             
[rank0]:[E513 13:25:57.716872543 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f75e5bd2446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f75e6ed7672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f75e6edeab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f75e6ee051d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f7633b0b5c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f7636c20609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f76369eb353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             

[rank1]:[E513 13:25:57.716878046 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f35c3c446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f36f41672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f36f48ab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f36f4a51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f8f83b755c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f8f86c8a609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f8f86a55353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             

Aborted (core dumped)                        

Steps/Code to reproduce bug

def main(cfg):
    logging.info(f'Hydra config: {OmegaConf.to_yaml(cfg)}')

    trainer = pl.Trainer(**resolve_trainer_cfg(cfg.trainer))
    exp_manager(trainer, cfg.get("exp_manager", None))
    asr_model = EncDecCTCModelBPE(cfg=cfg.model, trainer=trainer)

    # Initialize the weights of the model from another model, if provided via config
    asr_model.maybe_init_from_pretrained_checkpoint(cfg)

    trainer.fit(asr_model)

    if hasattr(cfg.model, 'test_ds') and cfg.model.test_ds.manifest_filepath is not None:
        if asr_model.prepare_test(trainer):
            trainer.test(asr_model)


if __name__ == '__main__':
    main()

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of NeMo install: poetry install
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

  • OS version: Ubuntu 20.04.6 LTS
  • PyTorch version: 2.5.1+cu118
  • Python version: 3.10

Additional context

Add any other context about the problem here. Example: GPU model

aditya-sanas avatar May 13 '25 16:05 aditya-sanas

@gautham-kollu : Can we please check this?

aditya-sanas avatar May 15 '25 09:05 aditya-sanas

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 20 '25 02:06 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jun 27 '25 02:06 github-actions[bot]

@aditya-sanas Have you solved this problem?

jimmysue avatar Sep 10 '25 08:09 jimmysue

ai generated, please verify

To resolve NCCL timeout errors occurring after extended training runs, try these solutions:

  1. Increase NCCL timeout values by setting these environment variables before launching your training:

    export NCCL_ASYNC_ERROR_HANDLING=1
    export NCCL_IB_TIMEOUT=30
    export NCCL_SOCKET_NTHREADS=1
    export NCCL_NSOCKS_PERTHREAD=4
    
  2. If using PyTorch Lightning trainer, increase the distributed timeout:

    trainer = pl.Trainer(
        distributed_timeout_minutes=60,  # Increase from default
        ...
    )
    
  3. Add regular checkpointing to your training configuration to enable resuming from failures.

  4. Consider reducing batch size or implementing gradient accumulation to decrease memory pressure.

  5. For persistent issues, enable NCCL debugging with:

    export NCCL_DEBUG=INFO
    

These timeouts typically occur due to network instability, system resource constraints, or memory issues during long training runs.

zhenyih avatar Sep 10 '25 13:09 zhenyih