torchtune Multi GPU timeout on save checkpoint (WorkNCCL, Watchdog, timeout)

hey,

thanks for providing the torchtune framework,

I have an issue with a timeout on saving a checkpoint for Llama 3.1 70B LoRa on multiple GPUs,

I am tuning on an AWS EC2 with 8xV100 GPUs, each one with 32GB memory,

let me know if you need additional error trace or info,

thanks in advance

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
[rank4]:[E1129 11:56:57.204156533 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank5]:[E1129 11:56:57.205862232 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
[rank5]:[E1129 11:56:57.205930931 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank7]:[E1129 11:56:57.218043510 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
[rank7]:[E1129 11:56:57.218122951 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank2]:[E1129 11:56:57.234345411 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
[rank2]:[E1129 11:56:57.234404509 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank3]:[E1129 11:56:57.239953363 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out.
[rank3]:[E1129 11:56:57.240043594 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank6]:[E1129 11:56:57.241972126 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600060 milliseconds before timing out.
[rank6]:[E1129 11:56:57.242056510 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank1]:[E1129 11:56:57.243483105 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600062 milliseconds before timing out.
[rank1]:[E1129 11:56:57.243538286 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank3]:[E1129 11:56:57.379910257 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank3]:[E1129 11:56:57.379954199 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1129 11:56:57.379960087 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.382595394 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af8b44ee446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7af8697cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7af8697d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7af8697d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7af8b46555c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7af8b509ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7af8b5129c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af8b44ee446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7af8697cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7af8697d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7af8697d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7af8b46555c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7af8b509ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7af8b5129c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af8b44ee446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7af86944271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7af8b46555c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x7af8b509ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7af8b5129c3c in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E1129 11:56:57.450086063 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 5] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank5]:[E1129 11:56:57.450113740 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E1129 11:56:57.450119892 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.451711521 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7478f478d446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7478a9bcc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7478a9bd3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7478a9bd561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7478f48f45c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7478f529ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7478f5329c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7478f478d446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7478a9bcc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7478a9bd3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7478a9bd561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7478f48f45c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7478f529ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7478f5329c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7478f478d446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7478a984271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7478f48f45c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x7478f529ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7478f5329c3c in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E1129 11:56:57.506027320 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 6] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank6]:[E1129 11:56:57.506049967 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E1129 11:56:57.506056263 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.507603862 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600060 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e4f26b12446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7e4edbdcc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e4edbdd3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e4edbdd561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7e4f26c795c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7e4f2769ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7e4f27729c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600060 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e4f26b12446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7e4edbdcc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e4edbdd3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e4edbdd561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7e4f26c795c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7e4f2769ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7e4f27729c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e4f26b12446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7e4edba4271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7e4f26c795c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x7e4f2769ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7e4f27729c3c in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E1129 11:56:57.531282495 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank2]:[E1129 11:56:57.531302843 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1129 11:56:57.531308394 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.532790938 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ef41631a446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ef3cb5cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ef3cb5d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ef3cb5d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7ef4164815c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7ef416e9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7ef416f29c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ef41631a446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ef3cb5cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ef3cb5d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ef3cb5d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7ef4164815c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7ef416e9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7ef416f29c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ef41631a446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7ef3cb24271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7ef4164815c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x7ef416e9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7ef416f29c3c in /lib/x86_64-linux-gnu/libc.so.6)

[rank7]:[E1129 11:56:57.642513817 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 7] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank7]:[E1129 11:56:57.642537731 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E1129 11:56:57.642560031 ProcessGroupNCCL.cpp:636] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.644117982 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x744df9fca446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x744daf3cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x744daf3d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x744daf3d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x744dfa1315c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x744dfaa9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x744dfab29c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x744df9fca446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x744daf3cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x744daf3d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x744daf3d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x744dfa1315c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x744dfaa9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x744dfab29c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x744df9fca446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x744daf04271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x744dfa1315c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x744dfaa9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x744dfab29c3c in /lib/x86_64-linux-gnu/libc.so.6)

W1129 11:56:59.150000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29574 closing signal SIGTERM
W1129 11:56:59.151000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29575 closing signal SIGTERM
W1129 11:56:59.152000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29576 closing signal SIGTERM
W1129 11:56:59.153000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29578 closing signal SIGTERM
W1129 11:56:59.154000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29579 closing signal SIGTERM
W1129 11:56:59.155000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29580 closing signal SIGTERM
W1129 11:56:59.156000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29581 closing signal SIGTERM
E1129 11:57:07.249000 29471 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 29577) of binary: /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/bin/tune", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torchtune/_cli/run.py", line 206, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)
  File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torchtune/_cli/run.py", line 95, in _run_distributed
    run(args)
  File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/recipes/lora_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-29_11:56:59
  host      : ip-172-31-12-154.us-west-2.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 29577)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 29577

Nov 29 '24 20:11 albertbn

@albertbn, sorry that you hit that issue. We are working on implementing distributed async checkpointing, which should avoid problems like this in the future. Meanwhile, one thing you can do is to modify the recipe to not save the recipe state.

The recipe state is 2x larger than the model, so some times it may take some time to save it.

TLDR: set this to false https://github.com/pytorch/torchtune/blob/32e265d5749fd592711a03247486eafa6c898d94/recipes/full_finetune_distributed.py#L702

Nov 30 '24 17:11 felipemello1

Thanks @felipemello1, your suggestions helped me resolve the issue, here are the TLDR; and details:

TLDR

In a recipe copy from this I edited the line init_process_group(...)by the end of the file to read:

from datetime import timedelta
...
timeout_long_ncll = timedelta(seconds=6000)  # 100 minutes
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl", timeout=timeout_long_ncll)

which extends the NCLL timeout x10 times - to 100 minutes compared to 10 minutes for the default value

...
INFO:torchtune.utils._logging:Saving checkpoint took 965.61 secs
...

it takes approximately 16 minutes to save the checkpoint for Llama 3.1 70B, since the LoRa adapters are merged into the full weights. The times reported are on 8xV100 (32GB) GPUs in additions to a lot of CPU cores and RAM

===

Details

I copied this to a local file train_lora_distr.py at the same level as my custom config
I copied the 70B LoRa config by tune cp llama3_1/70B_lora ./llama70b-instruct-config.yaml as my custom config
I've followed the instructions for downloading the model from the config instructions and have setup the desired flags, model, tokenizer, directories, epochs, etc... in ./llama70b-instruct-config.yaml
Eventually I ran: tune run --nproc_per_node 8 train_lora_distr --config ./llama70b-instruct-config.yaml which completed successfully

Note that tune run ... doesn't support this syntax ./train_lora_distr.py but expects just the module name as train_lora_distr

Cheers

Dec 01 '24 11:12 albertbn

Nice! I am glad that you figured it out. Thanks for the detailed steps. I will check if we need to fix those in the upcoming distributed ckpt PR.

fyi, you may need to hack your ckpt a bit if you are going to use it with HF and vllm. We have a PR that should land in the next couple of days that will fix it. More info here if you need the fix now: https://github.com/pytorch/torchtune/issues/2048

@albertbn

Dec 02 '24 02:12 felipemello1

cc: @joecummings fyi

Dec 02 '24 02:12 felipemello1

Note that tune run ... doesn't support this syntax ./train_lora_distr.py but expects just the module name as train_lora_distr

This is supposed to work. I think there's a bug with the way we process the path. If you do "train_lora_distr.py" instead it should work, omitting the "./". We support "/" but the "./" pattern is causing the break. We'll fix that going forward.

Dec 02 '24 16:12 pbontrager

Nice! I am glad that you figured it out. Thanks for the detailed steps. I will check if we need to fix those in the upcoming distributed ckpt PR.

fyi, you may need to hack your ckpt a bit if you are going to use it with HF and vllm. We have a PR that should land in the next couple of days that will fix it. More info here if you need the fix now: #2048

@albertbn

I've been having some frustrating time trying to use my trained model for inference. Skipping right to the question - I really liked the ideas implemented in the generate.py script of torchtune. I've used it successfully with a 8B llama tuned version, on a single GPU. For the 70B I obviously need distributed inference. The gpt-fast github mentioned in the torchtune examples seem to have a low active maintenance status, which is pity since I'd really prefer native torch inference, following their blogs. gpt-fast fails with a model params mismatch - I've opened an issue there. Do you have any leads of how to modify the generate.py from torchtune to support distributed inference? Speed optimization is less of an issue for me currently - I just want to generate some text to evaluate the quality of the tuned model.

thanks

Dec 04 '24 16:12 albertbn

@albertbn , would it be worth trying to run your inference with vllm? We will have some more documentation on it soon. I don't think that hacking generate.py to run distributed will be a low lift. @joecummings , please correct me if i am wrong.

Dec 04 '24 16:12 felipemello1

I tried, not very hard. I haven’t explored their code but the first error I got was when trying to provide a local path for a checkpoint and they expected only a Huggingface path, which is ridiculous

On Wed, 4 Dec 2024 at 18:31, Felipe Mello @.***> wrote:

@albertbn https://github.com/albertbn , would it be worth trying to run your inference with vllm? We will have some more documentation on it soon. I don't think that hacking generate.py to run distributed will be a low lift. @joecummings https://github.com/joecummings , please correct me if i am wrong.

— Reply to this email directly, view it on GitHub https://github.com/pytorch/torchtune/issues/2093#issuecomment-2517961434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJB452PZ47NG6HCI4TC5OD2D4U6VAVCNFSM6AAAAABSXVZY4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJXHE3DCNBTGQ . You are receiving this because you were mentioned.Message ID: @.***>

Dec 04 '24 16:12 albertbn

@albertbn I tried it locally, and it worked, but i didnt try distributed. Take a look at this PR that i am trying to land soon: https://github.com/pytorch/torchtune/issues/2048

from vllm import LLM, SamplingParams


def print_outputs(outputs):
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    print("-" * 80)


# llm = LLM(model="meta-llama/Llama-3.2-1B-Instruct")

llm = LLM(
    model="/tmp/llama_3_2_1b/lora_single_device/base_model",
    load_format="safetensors",
    kv_cache_dtype="auto",
)
sampling_params = SamplingParams(max_tokens=16, temperature=0.5)
# In this script, we demonstrate how to pass input to the chat method:

conversation = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hello! How can I assist you today?"},
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]
outputs = llm.chat(conversation, sampling_params=sampling_params, use_tqdm=False)
print_outputs(outputs)

# You can run batch inference with llm.chat API
conversation = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hello! How can I assist you today?"},
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]
conversations = [conversation for _ in range(3)]

# We turn on tqdm progress bar to verify it's indeed running batch inference
outputs = llm.chat(
    messages=conversations, sampling_params=sampling_params, use_tqdm=True
)
print_outputs(outputs)

Dec 04 '24 17:12 felipemello1

hey, thanks for the suggestion and sorry for the late reply,

I've made some progress with the distributed inference, but still haven't been able to generate text:

As a reminder - I've used torchtune successfully to LoRa tune a Llama 70B instruct (by solving the timeout issue on save from above)
I've then used a manual script to convert .pt weight files to .safetensors. If needed I can include the script
I've also copied some JSON config files from the original Llama model directory from HF . I had to change the dtype in config.json from bfloat16 to float16 since vllm is shouting that bfloat16 is not supported by an older (7) GPU (I have access to a machine with 8xV100 32GB GPUs)
I then ran the script below, the model loads but then fails at generation - could be due to using older GPUs (V100) which vllm don't support - I have really no idea...

from vllm import LLM, SamplingParams

PATH = "/home/ubuntu/projects/models"
checkpoint_dir = f'{PATH}/trained/llama_31_70b_instruct'

tensor_parallel_size = 8

llm = LLM(
    model=checkpoint_dir,
    load_format="safetensors",
    kv_cache_dtype="auto",
    tensor_parallel_size=tensor_parallel_size,
    disable_custom_all_reduce=True
)

conversation = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hello! How can I assist you today?"},
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]

sampling_params = SamplingParams(max_tokens=16, temperature=0.6)

outputs = llm.chat(conversation, sampling_params=sampling_params, use_tqdm=False)

python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
/home/ubuntu/.pyenv/versions/3.12.4/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Aborted (core dumped)

Dec 11 '24 14:12 albertbn

@albertbn From a quick search, looks like you might need to turn of prefill chunking b/c of an issue on the vLLM side: https://github.com/vllm-project/vllm/issues/6723.

Dec 11 '24 14:12 joecummings

Confirming @joecummings 's suggestion is working,

I had to add an additional flag max_model_len (in addition to enable_chunked_prefill).

Including the working script for LoRa Tuned Llama 70B instruct with torchtune, with converted HF .pt weights to .safetensors. Enclosing also the conversion script (by chatGPT), which should be ran before the vllm one.

Thanks again for all the help,

ps. If I am able to get the vllm to run with bfloat16 instead of float16 on V100 GPUs, I'll report back

conversion to .safetensors script

import os
import json
import shutil
import torch
from safetensors.torch import save_file

PATH = os.path.dirname(os.path.abspath(__file__))
checkpoint_dir = 'path/to/trained/model/with/.pt/weights/from/torchtune'
original_llama_weights_path = f'{PATH}/llama_31_70b_instruct'

# List of files to copy
files_to_copy = [
    "special_tokens_map.json",
    "tokenizer.json",
    "tokenizer_config.json",
    # "config.json",
    "generation_config.json",  # Optional but recommended
    "README.md",               # Optional but useful
    "LICENSE"                  # Optional but useful
]

# Output index file!
index_file = os.path.join(checkpoint_dir, "model.safetensors.index.json")

# Initialize the output dictionary
output_dict = {"weight_map": {}, "metadata": {"total_size": 0}}

# Iterate over all safetensors files
total_size = 0
for i in range(1, 31):
    # Convert from .pt to .safetensors
    pt_file = f"hf_model_{str(i).zfill(4)}_0.pt"
    safetensor_file = f"hf_model_{str(i).zfill(4)}_0.safetensors"
    pt_path = os.path.join(checkpoint_dir, pt_file)
    safetensor_path = os.path.join(checkpoint_dir, safetensor_file)  # in same dir - hope error shouting here is bearable

    print(f"Converting {pt_path} to {safetensor_path}...")

    # Load the state dictionary from the .pt file
    state_dict = torch.load(pt_path, map_location="cpu", weights_only=True)

    # Save the state dictionary to the .safetensors file
    save_file(state_dict, safetensor_path)

    # Add key mappings to the weight_map
    for key in state_dict.keys():
        output_dict["weight_map"][key] = safetensor_file

    # Calculate the file size and add it to the total
    total_size += os.path.getsize(safetensor_path)

# Add the total size to the metadata
output_dict["metadata"]["total_size"] = total_size
print(f"Total size: {total_size} bytes")

# Save the index.json file
with open(index_file, "w") as f:
    json.dump(output_dict, f, indent=2)

print(f"Index file created: {index_file}")

# ---
# DONE - cp all config stuff from models/
# Copy the files
for file_name in files_to_copy:
    src_path = os.path.join(original_llama_weights_path, file_name)
    dest_path = os.path.join(checkpoint_dir, file_name)

    if os.path.exists(src_path):
        shutil.copy(src_path, dest_path)
        print(f"Copied: {src_path} -> {dest_path}")
    else:
        print(f"File not found: {src_path}")

print("File copy operation completed.")

vllm distributed inference

from vllm import LLM, SamplingParams

PATH = "/home/ubuntu/projects/models"
checkpoint_dir = f'{PATH}/trained/llama_31_70b_instruct'

def print_outputs(outputs):
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    print("-" * 80)

tensor_parallel_size = 8

llm = LLM(
    model=checkpoint_dir,
    load_format="safetensors",
    kv_cache_dtype="auto",
    tensor_parallel_size=tensor_parallel_size,
    disable_custom_all_reduce=True,
    enable_chunked_prefill=False,
    max_model_len=2**14
)

conversation = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hello! How can I assist you today?"},
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]

sampling_params = SamplingParams(max_tokens=512, temperature=0.6)

outputs = llm.chat(conversation, sampling_params=sampling_params, use_tqdm=False)
print_outputs(outputs)

Dec 11 '24 15:12 albertbn

@albertbn, this is super nice! :)

To be clear, if you rebase to main or install nightlies, you shouldnt have to convert anything from .pt to .safetensors now. The adapters/model are automatically saved as .safetensors.

Dec 11 '24 15:12 felipemello1

hey, any leads on how to load a meta/llama 3.1 405B (the original one from HF) with vllm?

I am getting GPU out of memory. The data type in the original config is bfloat16. I have an upgraded machine with 8xH100 GPUs, each one with 80GB memory. The training is working fine, loading the model and saving the adapter only, which I guess should use later as vllm recommend here

thanks in advance

Dec 23 '24 12:12 albertbn

Update: following this issue, I was able to load the fp8 meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on 8xH100 GPUs with VLLM.

So it's not quite clear to me what the recipe example implies - it is for the regular meta-llama/Meta-Llama-3.1-405B-Instruct weights, which I've failed to load on said machine with vllm and haven't found any solution that could do it without CPU offloading/mapping.

Can I fit the above recipe to qloara-tune meta-llama/Meta-Llama-3.1-405B-Instruct-FP8?

thanks

Dec 23 '24 15:12 albertbn

hey @albertbn , i am having a bit of trouble understanding exactly what you are asking. Let me repeat it back to you what I got:

You trained 405B QLoRA using torchtune without problems. You adapter is in bf16, and the base model was quantized to nf4 during trained. You only saved only the adapters
You were able to successfully load the 405B as fp8, and you are also able to load the adapter
The question is if its ok to run the model in fp8

Is that what you are asking?

If so, i believe that there are many articles showing that, for inference, fp8 shows no performance degradation, AFAIK. But maybe you could evaluate it and see if the performance is what you expected.

Dec 23 '24 15:12 felipemello1

Hey, sorry for the confusion. I trained 405B qlora using the full 405b instruct model from meta and saved just the adapters. Exactly as the 405b config provided by torchtune is, no changes.

Then for inference with vllm, I was unable to load the original meta 405b instruct model, receiving OOM. The way I understand is that I should load the original weights with vllm and then add the adapter as shown in their documentation. I’ve tried also loading the original meta weights as fp8 which failed as well.

The only model I managed to load with vllm was a 405b instruct fp8 original version by meta, downloaded separately from Huggingface.

So what I am asking is: can I change the config to tune with qlora the 405b instruct fp8 from meta instead of their 405b instruct as shown in the torchtune config

On Mon, 23 Dec 2024 at 17:52, Felipe Mello @.***> wrote:

hey @albertbn https://github.com/albertbn , i am having a bit of trouble understanding exactly what you are asking. Let me repeat it back to you what I got:

You trained 405B QLoRA using torchtune without problems. You adapter is in bf16, and the base model was quantized for nf4 during trained. You only saved only the adapters

You were able to successfully load the 405B as fp8, and you are also able to load the adapter

The question is if its ok to run the model in fp8

Is that what you are asking?

If so, i believe that there are many articles showing that, for inference, fp8 shows no performance degradation, AFAIK. But maybe you could evaluate it and see if the performance is what you expected.

— Reply to this email directly, view it on GitHub https://github.com/pytorch/torchtune/issues/2093#issuecomment-2559940132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJB457OMP6JRB5PHHWUY4T2HAWVRAVCNFSM6AAAAABSXVZY4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJZHE2DAMJTGI . You are receiving this because you were mentioned.Message ID: @.***>

Dec 23 '24 18:12 albertbn

To avoid any confusion, I am further clarifying in brief:

meta-llama/Llama-3.1-405B-Instruct - fine tuned with qlora saving adapters only, as per original torchtune 405B config. Unable to load weights with VLLM in anyway - hitting OOM
meta-llama/Llama-3.1-405B-Instruct-FP8 - able to load with VLLM. Question - can I change the 405B config to tune with qlora on that one?

Dec 23 '24 19:12 albertbn

@albertbn, i am not sure. @ebsmothers , do you know if we can easily replace the quantize_base in QLoRA with float8 instead of nf4?

I have my doubts though on how much extra performance you can get from it. It might not be worth it. Did you find some article showing someone finetuning with fp8 base and it being better?

Dec 24 '24 15:12 felipemello1

@albertbn , related: https://github.com/pytorch/torchtune/issues/2201 . Lets maybe keep the convo there and close the issue here, since its not about NCCL anymore. What do you think?

Dec 24 '24 15:12 felipemello1

@felipemello1

do you know if we can easily replace the quantize_base in QLoRA with float8 instead of nf4?

From what I know, existing solutions in torchao do not support this use case. torchao.float8 does not support only keeping FP8 weight. There is float8_dynamic_activation_float8_weight that creates AffineQuantizedTensor subclass, but it doesn't define back-prop, we can't train with it.

It should be straight-forward to implement, just no one creates it yet in torchao.

Dec 25 '24 01:12 gau-nernst

I am still seeing the issue with distributed lora on Llama-3.1-70B. I tried to follow https://github.com/pytorch/torchtune/issues/2093#issuecomment-2509733176 link to increase the timeout. I do have sufficient CPU-RAM to save the model-checkpoints. I am unsure of the path to debug this. Appreciate any help with this.

Here is the traceback of the error.


Traceback (most recent call last):
  File "/home/pc/.conda/envs/testenv/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
    parser.run(args)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
    args.func(args)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
    run(args)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/run.py", line 880, in run
    elastic_launch(
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_lora_distributed FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-04_08:47:49
  host      : dldev01.host.com
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 216824)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 216824

Feb 04 '25 18:02 pranathichunduru

@pranathichunduru , can you please share a longer log?

Feb 04 '25 19:02 felipemello1

@pranathichunduru , can you please share a longer log?

This is from the start of the process:


Setting manual seed to local seed 2241663802. Local seed is seed + rank = 2241663802 + 0
Writing logs to /home/pc/pc/src/scratch/NLME-Darwin/torchtune/llama3_1_70B/lora/logs/log_1738697935.txt
FSDP is enabled. Instantiating model and loading checkpoint on Rank 0 ...
Compiling model layers with torch.compile...
Instantiating model and loading checkpoint took 165.87 secs
Memory stats after model init:
        GPU peak memory allocation: 34.92 GiB
        GPU peak memory reserved: 35.61 GiB
        GPU peak memory active: 34.92 GiB
Optimizer is initialized.
Compiling loss with torch.compile...
Loss is initialized.
Dataset and Sampler are initialized.
Learning rate scheduler is initialized.
 Profiling disabled.
 Profiler config after instantiation: {'enabled': False}
  0%|                                                                                                        | 0/161 [00:00<?, ?it/s]/home/pc/.conda/envs/pydarwin/lib/python3.10/site-packages/torchtune/training/_activation_offloading.py:113: UserWarning: ***** WARNING: curr_pct=65.6% > self.virtual_memory_safe_pct=60% of virtual memory used
  warn(
/home/pc/.conda/envs/pydarwin/lib/python3.10/site-packages/torchtune/training/_activation_offloading.py:113: UserWarning: ***** WARNING: curr_pct=65.6% > self.virtual_memory_safe_pct=60% of virtual memory used
  warn(
/home/pc/.conda/envs/pydarwin/lib/python3.10/site-packages/torchtune/training/_activation_offloading.py:113: UserWarning: ***** WARNING: curr_pct=65.6% > self.virtual_memory_safe_pct=60% of virtual memory used
  warn(
/home/pc/.conda/envs/pydarwin/lib/python3.10/site-packages/torchtune/training/_activation_offloading.py:113: UserWarning: ***** WARNING: curr_pct=65.7% > self.virtual_memory_safe_pct=60% of virtual memory used
  warn(
  ...
  1|157|Loss: 0.0007009985274635255:  98%|███████████████████████████████████████████████████████▉ | 158/161 [2:55:17<03:1|158|Loss: 0.0060004680417478085:  98%|███████████████████████████████████████████████████████▉ | 158/161 [2:55:17<03:1|158|Loss: 0.0060004680417478085:  99%|████████████████████████████████████████████████████████▎| 159/161 [2:56:23<02:
Saving checkpoint. This may take some time. Retrieving full model state dict...
Traceback (most recent call last):
  File "/home/pc/.conda/envs/testenv/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
    parser.run(args)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
    args.func(args)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
    run(args)
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/run.py", line 880, in run
    elastic_launch(
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_lora_distributed FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-04_08:47:49
  host      : dldev01.host.com
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 1194347)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1194347

Also this is from running dmesg | tail -n 50 | grep -i "oom" to check the OOM error for the PID. [73601.522898] Out of memory: Killed process 1194347 (python3.1) total-vm:201694896kB, anon-rss:111854380kB, file-rss:75048kB, shmem-rss:9628272kB, UID:1008 pgtables:240660kB oom_score_adj:0

I am trying to finetune the Llama-3.1-70B LORA model on 4-A100's (80GB VRAM). I was able to finetune the smaller Llama-3.1-8B using Lora with
tune run lora_finetune_single_device --config configs/llama_3.1_8B_lora.yaml and this ran fine without any errors. I am having trouble with distributed training on 70B model.

This is my config file:

# Config for multi-device LoRA in lora_finetune_distributed.py
# using a Llama3.1 70B model
#
# This config assumes that you've run the following command before launching
# this run:
#   tune download meta-llama/Meta-Llama-3.1-70B-Instruct --output-dir /tmp/Meta-Llama-3.1-70B-Instruct --ignore-patterns "original/consolidated*"
#
# This config needs 8 GPUs to run
#   tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_1/70B_lora

output_dir: /home/pc/pc/src/scratch/NLME-Darwin/torchtune/llama3_1_70B/lora # /tmp may be deleted by your system. Change it to your preference.

# Model Arguments
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_70b
  lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
  apply_lora_to_mlp: True
  apply_lora_to_output: False
  lora_rank: 8  # higher increases accuracy and memory
  lora_alpha: 16  # usually alpha=2*rank
  lora_dropout: 0.0

tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /home/pc/.cache/huggingface/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b/original/tokenizer.model
  max_seq_len: null

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /home/pc/.cache/huggingface/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b/
  checkpoint_files: [
    model-00001-of-00030.safetensors,
    model-00002-of-00030.safetensors,
    model-00003-of-00030.safetensors,
    model-00004-of-00030.safetensors,
    model-00005-of-00030.safetensors,
    model-00006-of-00030.safetensors,
    model-00007-of-00030.safetensors,
    model-00008-of-00030.safetensors,
    model-00009-of-00030.safetensors,
    model-00010-of-00030.safetensors,
    model-00011-of-00030.safetensors,
    model-00012-of-00030.safetensors,
    model-00013-of-00030.safetensors,
    model-00014-of-00030.safetensors,
    model-00015-of-00030.safetensors,
    model-00016-of-00030.safetensors,
    model-00017-of-00030.safetensors,
    model-00018-of-00030.safetensors,
    model-00019-of-00030.safetensors,
    model-00020-of-00030.safetensors,
    model-00021-of-00030.safetensors,
    model-00022-of-00030.safetensors,
    model-00023-of-00030.safetensors,
    model-00024-of-00030.safetensors,
    model-00025-of-00030.safetensors,
    model-00026-of-00030.safetensors,
    model-00027-of-00030.safetensors,
    model-00028-of-00030.safetensors,
    model-00029-of-00030.safetensors,
    model-00030-of-00030.safetensors,
  ]
  recipe_checkpoint: null
  output_dir: ${output_dir}
  model_type: LLAMA3

resume_from_checkpoint: False
save_adapter_weights_only: True # Set to false to save the whole model + adapter merged

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  data_files: /home/pc/pc/src/scratch/NLME-Darwin/data.jsonl
  packed: False  # True increases speed
  conversation_column: messages
  conversation_style: openai

seed: null
shuffle: True
batch_size: 2

# Optimizer and Scheduler
optimizer:
  _component_: torch.optim.AdamW
  fused: True
  weight_decay: 0.01
  lr: 3e-4
#  offload_optimizer: True  # Moves optimizer states to CPU before saving
lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 100

loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss

# Training
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 1  # Use to increase effective batch size
compile: True  # torch.compile the model + loss, True increases speed + decreases memory

# Logging
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True

# Environment
device: cuda
dtype: bf16
enable_activation_checkpointing: True  # True reduces memory
enable_activation_offloading: True  # True reduces memory
# custom_sharded_layers: ['tok_embeddings', 'output']  # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
#offload_activations: True  # More aggressive offloading
#offload_parameters: True  # If running out of GPU RAM


# Profiler (disabled)
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: False

  #Output directory of trace artifacts
  output_dir: ${output_dir}/profiling_outputs

  #`torch.profiler.ProfilerActivity` types to trace
  cpu: True
  cuda: True

  #trace options passed to `torch.profiler.profile`
  profile_memory: False
  with_stack: False
  record_shapes: True
  with_flops: False

  # `torch.profiler.schedule` options:
  # wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
  wait_steps: 5
  warmup_steps: 3
  active_steps: 2
  num_cycles: 1

Appreciate your help with this !

Feb 04 '25 23:02 pranathichunduru

@joecummings , is the distributed checkpoint ready for lora distributed?

Feb 04 '25 23:02 felipemello1

I am using Full Fine-Tuning and I still get this error

[rank0]:[E227 23:07:54.322446753 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619725, OpType=_ALLGATHER_BASE, NumelIn=54528000, NumelOut=218112000, Timeout(ms)=1800000) ran for 1800034 milliseconds before timing out.
[rank0]:[E227 23:07:54.322713545 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0]  failure detected by watchdog at work sequence id: 619725 PG status: last enqueued work: 619732, last completed work: 619724
[rank0]:[E227 23:07:54.322728963 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E227 23:07:54.323599416 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619727, OpType=_REDUCE_SCATTER_BASE, NumelIn=4096, NumelOut=1024, Timeout(ms)=1800000) ran for 1800060 milliseconds before timing out.
[rank3]:[E227 23:07:54.323727781 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3]  failure detected by watchdog at work sequence id: 619727 PG status: last enqueued work: 619730, last completed work: 619726
[rank3]:[E227 23:07:54.323739223 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E227 23:07:54.330858646 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619725, OpType=_ALLGATHER_BASE, NumelIn=54528000, NumelOut=218112000, Timeout(ms)=1800000) ran for 1800061 milliseconds before timing out.
[rank2]:[E227 23:07:54.331014928 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 619725 PG status: last enqueued work: 619732, last completed work: 619724
[rank2]:[E227 23:07:54.331025732 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E227 23:07:54.334413980 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619725, OpType=_ALLGATHER_BASE, NumelIn=54528000, NumelOut=218112000, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
[rank1]:[E227 23:07:54.334525713 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 619725 PG status: last enqueued work: 619732, last completed work: 619724
[rank1]:[E227 23:07:54.334558759 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E227 23:07:54.990792231 ProcessGroupNCCL.cpp:681] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E227 23:07:54.990834934 ProcessGroupNCCL.cpp:695] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E227 23:07:54.992160664 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619727, OpType=_REDUCE_SCATTER_BASE, NumelIn=4096, NumelOut=1024, Timeout(ms)=1800000) ran for 1800060 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x154e2830d1b6 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x154e29656c74 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x154e296587d0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x154e296596ed in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x154e726dc5c0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x154e7bbaec02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x154e7bc33c40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619727, OpType=_REDUCE_SCATTER_BASE, NumelIn=4096, NumelOut=1024, Timeout(ms)=1800000) ran for 1800060 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x154e2830d1b6 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x154e29656c74 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x154e296587d0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x154e296596ed in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x154e726dc5c0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x154e7bbaec02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x154e7bc33c40 in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x154e2830d1b6 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5c6fc (0x154e292b46fc in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x154e726dc5c0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x89c02 (0x154e7bbaec02 in /lib64/libc.so.6)
frame #4: <unknown function> + 0x10ec40 (0x154e7bc33c40 in /lib64/libc.so.6)

W0227 23:07:55.697000 249042 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 249078 closing signal SIGTERM
W0227 23:07:55.717000 249042 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 249079 closing signal SIGTERM
W0227 23:07:55.718000 249042 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 249080 closing signal SIGTERM
E0227 23:07:56.813000 249042 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 249081) of binary: /home/mantrik/.conda/envs/hf/bin/python3.10
Running with torchrun...
Traceback (most recent call last):
  File "/home/mantrik/.conda/envs/hf/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)
  File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
    run(args)
  File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/recipes/full_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-27_23:07:55
  host      : h001.gautschi.rcac.purdue.edu
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 249081)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 249081
============================================================

Feb 28 '25 16:02 ipsitmantri

same

May 03 '25 05:05 2019211753

I encountered the same WorkNCCL timeout during checkpoint saving and wanted to share my investigation process and solutions that might help others.

Problem Description

[rank13]:[E801 17:37:10.950994062 ProcessGroupNCCL.cpp:633] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3194, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.

Investigation Process

Initial Analysis:

Timeout occurred after all model checkpoint files were successfully saved (model-00001-of-00016.safetensors to model-00016-of-00016.safetensors)
Missing expected log: "Saving checkpoint took X.XX secs"

Key Code Flow Analysis: The timeout happens in _save_checkpoint_sync() method:

# In torchtune/training/checkpointing/_checkpoint_client.py
def _save_checkpoint_sync(self, ...):
    def _save_checkpoint_helper():
        # Save model files (✅ completed successfully)
        self._get_checkpointer().save_checkpoint(...)
        
        # This log was never reached ❌
        log.info(f"Saving checkpoint took {time.perf_counter() - cp_start:.2f} secs")
    
    if is_not_distributed_checkpointer and not single_device:
        if self._is_rank_zero:
            _save_checkpoint_helper()  # Rank 0 gets stuck here
        
        torch.distributed.barrier()  # Other ranks timeout waiting here

Root Cause Identified: The bottleneck is saving recipe_state.pt which contains:

Optimizer state dict (~120GB for 30B model with AdamW)
Training progress metadata
Dataloader state

For large models, serializing and writing optimizer states can take 5-10+ minutes, causing other ranks to timeout at the barrier.

Solutions

properly extend the collective timeout

from datetime import timedelta

# delay ProcessGroupNCCL.cpp Watchdog caught collective operation timeout to 1 hour
init_process_group(backend=self.distributed_backend, timeout=timedelta(seconds=3600))

you need to explicitly set the timeout parameter in full_finetune_distributed.py when calling torch.distributed.init_process_group.

This way, PyTorch will pass the 1-hour timeout down to the C++ layer, and NCCL collective operations will wait up to 1 hour before timing out.

Hope this helps others encountering similar issues! The key is identifying whether the timeout is during actual distributed communication or just waiting for slow I/O operations.

Aug 01 '25 10:08 pengyanai