DeepSpeedExamples add tp example

FYI , @hwchen2017

Apr 07 '25 08:04 inkcherry

I'm unable to get this to work.

First I run: bash run.sh zero2 (all of the options fail with the same error)

Time to load fused_adam op: 0.1688675880432129 seconds
[rank4]:[E417 20:43:19.177767953 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=6
00000) ran for 600016 milliseconds before timing out.
[rank4]:[E417 20:43:19.178407423 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 291, last completed
 NCCL work: -1.
[rank4]:[E417 20:43:19.178477173 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 4] Timeout at NCCL work: 1, last enqueued NCCL work: 291, last completed NCCL work: -1.
[rank4]:[E417 20:43:19.178491283 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on c
orrupted/incomplete data.
[rank4]:[E417 20:43:19.178503043 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E417 20:43:19.180988563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647429097/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a237f9c8446 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7a232e5f14d2 in /home/erikg/micromamba/envs/deepseed-examples/lib/
python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a232e5f8913 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a232e5fa37d in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7a2386b785c0 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9caa4 (0x7a238769caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7a2387729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=13107609
6, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

What am I doing wrong?

Apr 17 '25 21:04 ekg

I'm unable to get this to work.

First I run: bash run.sh zero2 (all of the options fail with the same error)

Time to load fused_adam op: 0.1688675880432129 seconds
[rank4]:[E417 20:43:19.177767953 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=6
00000) ran for 600016 milliseconds before timing out.
[rank4]:[E417 20:43:19.178407423 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 291, last completed
 NCCL work: -1.
[rank4]:[E417 20:43:19.178477173 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 4] Timeout at NCCL work: 1, last enqueued NCCL work: 291, last completed NCCL work: -1.
[rank4]:[E417 20:43:19.178491283 ProcessGroupNCCL.cpp:630] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on c
orrupted/incomplete data.
[rank4]:[E417 20:43:19.178503043 ProcessGroupNCCL.cpp:636] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E417 20:43:19.180988563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=1, OpType=BROADCAST, NumelIn=131076096, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647429097/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a237f9c8446 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7a232e5f14d2 in /home/erikg/micromamba/envs/deepseed-examples/lib/
python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a232e5f8913 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a232e5fa37d in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7a2386b785c0 in /home/erikg/micromamba/envs/deepseed-examples/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9caa4 (0x7a238769caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7a2387729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=13107609
6, NumelOut=131076096, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.

What am I doing wrong?

Hi， ekg, If standard ZeRO-1/2 still fails to run properly, it may be due to incorrect configuration of the your CUDA and NCCL versions.

Apr 18 '25 00:04 inkcherry

@hwchen2017 just a reminder in case you miss this~ thanks.

Apr 18 '25 00:04 inkcherry