Multi GPU timeout on save checkpoint (WorkNCCL, Watchdog, timeout)
hey,
thanks for providing the torchtune framework,
I have an issue with a timeout on saving a checkpoint for Llama 3.1 70B LoRa on multiple GPUs,
I am tuning on an AWS EC2 with 8xV100 GPUs, each one with 32GB memory,
let me know if you need additional error trace or info,
thanks in advance
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
[rank4]:[E1129 11:56:57.204156533 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank5]:[E1129 11:56:57.205862232 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
[rank5]:[E1129 11:56:57.205930931 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank7]:[E1129 11:56:57.218043510 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
[rank7]:[E1129 11:56:57.218122951 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank2]:[E1129 11:56:57.234345411 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
[rank2]:[E1129 11:56:57.234404509 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank3]:[E1129 11:56:57.239953363 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out.
[rank3]:[E1129 11:56:57.240043594 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank6]:[E1129 11:56:57.241972126 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600060 milliseconds before timing out.
[rank6]:[E1129 11:56:57.242056510 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank1]:[E1129 11:56:57.243483105 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600062 milliseconds before timing out.
[rank1]:[E1129 11:56:57.243538286 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank3]:[E1129 11:56:57.379910257 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank3]:[E1129 11:56:57.379954199 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1129 11:56:57.379960087 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.382595394 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af8b44ee446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7af8697cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7af8697d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7af8697d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7af8b46555c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7af8b509ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7af8b5129c3c in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af8b44ee446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7af8697cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7af8697d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7af8697d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7af8b46555c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7af8b509ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7af8b5129c3c in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af8b44ee446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7af86944271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7af8b46555c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x7af8b509ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7af8b5129c3c in /lib/x86_64-linux-gnu/libc.so.6)
[rank5]:[E1129 11:56:57.450086063 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 5] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank5]:[E1129 11:56:57.450113740 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E1129 11:56:57.450119892 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.451711521 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7478f478d446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7478a9bcc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7478a9bd3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7478a9bd561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7478f48f45c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7478f529ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7478f5329c3c in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7478f478d446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7478a9bcc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7478a9bd3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7478a9bd561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7478f48f45c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7478f529ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7478f5329c3c in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7478f478d446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7478a984271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7478f48f45c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x7478f529ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7478f5329c3c in /lib/x86_64-linux-gnu/libc.so.6)
[rank6]:[E1129 11:56:57.506027320 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 6] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank6]:[E1129 11:56:57.506049967 ProcessGroupNCCL.cpp:630] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E1129 11:56:57.506056263 ProcessGroupNCCL.cpp:636] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.507603862 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600060 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e4f26b12446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7e4edbdcc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e4edbdd3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e4edbdd561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7e4f26c795c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7e4f2769ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7e4f27729c3c in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600060 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e4f26b12446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7e4edbdcc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e4edbdd3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e4edbdd561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7e4f26c795c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7e4f2769ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7e4f27729c3c in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e4f26b12446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7e4edba4271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7e4f26c795c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x7e4f2769ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7e4f27729c3c in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E1129 11:56:57.531282495 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank2]:[E1129 11:56:57.531302843 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1129 11:56:57.531308394 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.532790938 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ef41631a446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ef3cb5cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ef3cb5d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ef3cb5d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7ef4164815c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7ef416e9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7ef416f29c3c in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ef41631a446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ef3cb5cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ef3cb5d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ef3cb5d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7ef4164815c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x7ef416e9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x7ef416f29c3c in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ef41631a446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7ef3cb24271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7ef4164815c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x7ef416e9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7ef416f29c3c in /lib/x86_64-linux-gnu/libc.so.6)
[rank7]:[E1129 11:56:57.642513817 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 7] Timeout at NCCL work: 28956, last enqueued NCCL work: 28956, last completed NCCL work: 28955.
[rank7]:[E1129 11:56:57.642537731 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E1129 11:56:57.642560031 ProcessGroupNCCL.cpp:636] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[E1129 11:56:57.644117982 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x744df9fca446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x744daf3cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x744daf3d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x744daf3d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x744dfa1315c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x744dfaa9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x744dfab29c3c in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28956, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x744df9fca446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x744daf3cc772 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x744daf3d3bb3 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x744daf3d561d in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x744dfa1315c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x9ca94 (0x744dfaa9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x129c3c (0x744dfab29c3c in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x744df9fca446 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x744daf04271b in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x744dfa1315c0 in /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x9ca94 (0x744dfaa9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x744dfab29c3c in /lib/x86_64-linux-gnu/libc.so.6)
W1129 11:56:59.150000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29574 closing signal SIGTERM
W1129 11:56:59.151000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29575 closing signal SIGTERM
W1129 11:56:59.152000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29576 closing signal SIGTERM
W1129 11:56:59.153000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29578 closing signal SIGTERM
W1129 11:56:59.154000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29579 closing signal SIGTERM
W1129 11:56:59.155000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29580 closing signal SIGTERM
W1129 11:56:59.156000 29471 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 29581 closing signal SIGTERM
E1129 11:57:07.249000 29471 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 29577) of binary: /home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/bin/python
Traceback (most recent call last):
File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/bin/tune", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torchtune/_cli/run.py", line 206, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torchtune/_cli/run.py", line 95, in _run_distributed
run(args)
File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ubuntu/projects/frameai_nlp/lib/frameai_nlp/Labs/ad_professor/models/venv3/lib/python3.12/site-packages/recipes/lora_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-29_11:56:59
host : ip-172-31-12-154.us-west-2.compute.internal
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 29577)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 29577
@albertbn, sorry that you hit that issue. We are working on implementing distributed async checkpointing, which should avoid problems like this in the future. Meanwhile, one thing you can do is to modify the recipe to not save the recipe state.
The recipe state is 2x larger than the model, so some times it may take some time to save it.
TLDR: set this to false https://github.com/pytorch/torchtune/blob/32e265d5749fd592711a03247486eafa6c898d94/recipes/full_finetune_distributed.py#L702
Thanks @felipemello1, your suggestions helped me resolve the issue, here are the TLDR; and details:
TLDR
In a recipe copy from this I edited the line init_process_group(...)by the end of the file to read:
from datetime import timedelta
...
timeout_long_ncll = timedelta(seconds=6000) # 100 minutes
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl", timeout=timeout_long_ncll)
which extends the NCLL timeout x10 times - to 100 minutes compared to 10 minutes for the default value
...
INFO:torchtune.utils._logging:Saving checkpoint took 965.61 secs
...
it takes approximately 16 minutes to save the checkpoint for Llama 3.1 70B, since the LoRa adapters are merged into the full weights. The times reported are on 8xV100 (32GB) GPUs in additions to a lot of CPU cores and RAM
===
Details
- I copied this to a local file
train_lora_distr.pyat the same level as my custom config - I copied the 70B LoRa config by
tune cp llama3_1/70B_lora ./llama70b-instruct-config.yamlas my custom config - I've followed the instructions for downloading the model from the config instructions and have setup the desired flags, model, tokenizer, directories, epochs, etc... in
./llama70b-instruct-config.yaml - Eventually I ran:
tune run --nproc_per_node 8 train_lora_distr --config ./llama70b-instruct-config.yamlwhich completed successfully
Note that tune run ... doesn't support this syntax ./train_lora_distr.py but expects just the module name as train_lora_distr
Cheers
Nice! I am glad that you figured it out. Thanks for the detailed steps. I will check if we need to fix those in the upcoming distributed ckpt PR.
fyi, you may need to hack your ckpt a bit if you are going to use it with HF and vllm. We have a PR that should land in the next couple of days that will fix it. More info here if you need the fix now: https://github.com/pytorch/torchtune/issues/2048
@albertbn
cc: @joecummings fyi
Note that tune run ... doesn't support this syntax ./train_lora_distr.py but expects just the module name as train_lora_distr
This is supposed to work. I think there's a bug with the way we process the path. If you do "train_lora_distr.py" instead it should work, omitting the "./". We support "/" but the "./" pattern is causing the break. We'll fix that going forward.
Nice! I am glad that you figured it out. Thanks for the detailed steps. I will check if we need to fix those in the upcoming distributed ckpt PR.
fyi, you may need to hack your ckpt a bit if you are going to use it with HF and vllm. We have a PR that should land in the next couple of days that will fix it. More info here if you need the fix now: #2048
@albertbn
I've been having some frustrating time trying to use my trained model for inference. Skipping right to the question - I really liked the ideas implemented in the generate.py script of torchtune. I've used it successfully with a 8B llama tuned version, on a single GPU. For the 70B I obviously need distributed inference. The gpt-fast github mentioned in the torchtune examples seem to have a low active maintenance status, which is pity since I'd really prefer native torch inference, following their blogs. gpt-fast fails with a model params mismatch - I've opened an issue there.
Do you have any leads of how to modify the generate.py from torchtune to support distributed inference? Speed optimization is less of an issue for me currently - I just want to generate some text to evaluate the quality of the tuned model.
thanks
@albertbn , would it be worth trying to run your inference with vllm? We will have some more documentation on it soon. I don't think that hacking generate.py to run distributed will be a low lift. @joecummings , please correct me if i am wrong.
I tried, not very hard. I haven’t explored their code but the first error I got was when trying to provide a local path for a checkpoint and they expected only a Huggingface path, which is ridiculous
On Wed, 4 Dec 2024 at 18:31, Felipe Mello @.***> wrote:
@albertbn https://github.com/albertbn , would it be worth trying to run your inference with vllm? We will have some more documentation on it soon. I don't think that hacking generate.py to run distributed will be a low lift. @joecummings https://github.com/joecummings , please correct me if i am wrong.
— Reply to this email directly, view it on GitHub https://github.com/pytorch/torchtune/issues/2093#issuecomment-2517961434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJB452PZ47NG6HCI4TC5OD2D4U6VAVCNFSM6AAAAABSXVZY4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJXHE3DCNBTGQ . You are receiving this because you were mentioned.Message ID: @.***>
@albertbn I tried it locally, and it worked, but i didnt try distributed. Take a look at this PR that i am trying to land soon: https://github.com/pytorch/torchtune/issues/2048
from vllm import LLM, SamplingParams
def print_outputs(outputs):
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print("-" * 80)
# llm = LLM(model="meta-llama/Llama-3.2-1B-Instruct")
llm = LLM(
model="/tmp/llama_3_2_1b/lora_single_device/base_model",
load_format="safetensors",
kv_cache_dtype="auto",
)
sampling_params = SamplingParams(max_tokens=16, temperature=0.5)
# In this script, we demonstrate how to pass input to the chat method:
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I assist you today?"},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
outputs = llm.chat(conversation, sampling_params=sampling_params, use_tqdm=False)
print_outputs(outputs)
# You can run batch inference with llm.chat API
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I assist you today?"},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
conversations = [conversation for _ in range(3)]
# We turn on tqdm progress bar to verify it's indeed running batch inference
outputs = llm.chat(
messages=conversations, sampling_params=sampling_params, use_tqdm=True
)
print_outputs(outputs)
hey, thanks for the suggestion and sorry for the late reply,
I've made some progress with the distributed inference, but still haven't been able to generate text:
-
As a reminder - I've used torchtune successfully to LoRa tune a Llama 70B instruct (by solving the timeout issue on save from above)
-
I've then used a manual script to convert
.ptweight files to .safetensors. If needed I can include the script -
I've also copied some JSON config files from the original Llama model directory from HF . I had to change the dtype in
config.jsonfrombfloat16tofloat16since vllm is shouting thatbfloat16is not supported by an older (7) GPU (I have access to a machine with 8xV100 32GB GPUs) -
I then ran the script below, the model loads but then fails at generation - could be due to using older GPUs (V100) which vllm don't support - I have really no idea...
from vllm import LLM, SamplingParams
PATH = "/home/ubuntu/projects/models"
checkpoint_dir = f'{PATH}/trained/llama_31_70b_instruct'
tensor_parallel_size = 8
llm = LLM(
model=checkpoint_dir,
load_format="safetensors",
kv_cache_dtype="auto",
tensor_parallel_size=tensor_parallel_size,
disable_custom_all_reduce=True
)
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I assist you today?"},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
sampling_params = SamplingParams(max_tokens=16, temperature=0.6)
outputs = llm.chat(conversation, sampling_params=sampling_params, use_tqdm=False)
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
python: /project/lib/Analysis/Allocation.cpp:47: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
/home/ubuntu/.pyenv/versions/3.12.4/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Aborted (core dumped)
@albertbn From a quick search, looks like you might need to turn of prefill chunking b/c of an issue on the vLLM side: https://github.com/vllm-project/vllm/issues/6723.
Confirming @joecummings 's suggestion is working,
I had to add an additional flag max_model_len (in addition to enable_chunked_prefill).
Including the working script for LoRa Tuned Llama 70B instruct with torchtune, with converted HF .pt weights to .safetensors. Enclosing also the conversion script (by chatGPT), which should be ran before the vllm one.
Thanks again for all the help,
ps. If I am able to get the vllm to run with bfloat16 instead of float16 on V100 GPUs, I'll report back
conversion to .safetensors script
import os
import json
import shutil
import torch
from safetensors.torch import save_file
PATH = os.path.dirname(os.path.abspath(__file__))
checkpoint_dir = 'path/to/trained/model/with/.pt/weights/from/torchtune'
original_llama_weights_path = f'{PATH}/llama_31_70b_instruct'
# List of files to copy
files_to_copy = [
"special_tokens_map.json",
"tokenizer.json",
"tokenizer_config.json",
# "config.json",
"generation_config.json", # Optional but recommended
"README.md", # Optional but useful
"LICENSE" # Optional but useful
]
# Output index file!
index_file = os.path.join(checkpoint_dir, "model.safetensors.index.json")
# Initialize the output dictionary
output_dict = {"weight_map": {}, "metadata": {"total_size": 0}}
# Iterate over all safetensors files
total_size = 0
for i in range(1, 31):
# Convert from .pt to .safetensors
pt_file = f"hf_model_{str(i).zfill(4)}_0.pt"
safetensor_file = f"hf_model_{str(i).zfill(4)}_0.safetensors"
pt_path = os.path.join(checkpoint_dir, pt_file)
safetensor_path = os.path.join(checkpoint_dir, safetensor_file) # in same dir - hope error shouting here is bearable
print(f"Converting {pt_path} to {safetensor_path}...")
# Load the state dictionary from the .pt file
state_dict = torch.load(pt_path, map_location="cpu", weights_only=True)
# Save the state dictionary to the .safetensors file
save_file(state_dict, safetensor_path)
# Add key mappings to the weight_map
for key in state_dict.keys():
output_dict["weight_map"][key] = safetensor_file
# Calculate the file size and add it to the total
total_size += os.path.getsize(safetensor_path)
# Add the total size to the metadata
output_dict["metadata"]["total_size"] = total_size
print(f"Total size: {total_size} bytes")
# Save the index.json file
with open(index_file, "w") as f:
json.dump(output_dict, f, indent=2)
print(f"Index file created: {index_file}")
# ---
# DONE - cp all config stuff from models/
# Copy the files
for file_name in files_to_copy:
src_path = os.path.join(original_llama_weights_path, file_name)
dest_path = os.path.join(checkpoint_dir, file_name)
if os.path.exists(src_path):
shutil.copy(src_path, dest_path)
print(f"Copied: {src_path} -> {dest_path}")
else:
print(f"File not found: {src_path}")
print("File copy operation completed.")
vllm distributed inference
from vllm import LLM, SamplingParams
PATH = "/home/ubuntu/projects/models"
checkpoint_dir = f'{PATH}/trained/llama_31_70b_instruct'
def print_outputs(outputs):
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print("-" * 80)
tensor_parallel_size = 8
llm = LLM(
model=checkpoint_dir,
load_format="safetensors",
kv_cache_dtype="auto",
tensor_parallel_size=tensor_parallel_size,
disable_custom_all_reduce=True,
enable_chunked_prefill=False,
max_model_len=2**14
)
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I assist you today?"},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
sampling_params = SamplingParams(max_tokens=512, temperature=0.6)
outputs = llm.chat(conversation, sampling_params=sampling_params, use_tqdm=False)
print_outputs(outputs)
@albertbn, this is super nice! :)
To be clear, if you rebase to main or install nightlies, you shouldnt have to convert anything from .pt to .safetensors now. The adapters/model are automatically saved as .safetensors.
hey, any leads on how to load a meta/llama 3.1 405B (the original one from HF) with vllm?
I am getting GPU out of memory. The data type in the original config is bfloat16. I have an upgraded machine with 8xH100 GPUs, each one with 80GB memory. The training is working fine, loading the model and saving the adapter only, which I guess should use later as vllm recommend here
thanks in advance
Update: following this issue, I was able to load the fp8 meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on 8xH100 GPUs with VLLM.
So it's not quite clear to me what the recipe example implies - it is for the regular meta-llama/Meta-Llama-3.1-405B-Instruct weights, which I've failed to load on said machine with vllm and haven't found any solution that could do it without CPU offloading/mapping.
Can I fit the above recipe to qloara-tune meta-llama/Meta-Llama-3.1-405B-Instruct-FP8?
thanks
hey @albertbn , i am having a bit of trouble understanding exactly what you are asking. Let me repeat it back to you what I got:
- You trained 405B QLoRA using torchtune without problems. You adapter is in bf16, and the base model was quantized to nf4 during trained. You only saved only the adapters
- You were able to successfully load the 405B as fp8, and you are also able to load the adapter
- The question is if its ok to run the model in fp8
Is that what you are asking?
If so, i believe that there are many articles showing that, for inference, fp8 shows no performance degradation, AFAIK. But maybe you could evaluate it and see if the performance is what you expected.
Hey, sorry for the confusion. I trained 405B qlora using the full 405b instruct model from meta and saved just the adapters. Exactly as the 405b config provided by torchtune is, no changes.
Then for inference with vllm, I was unable to load the original meta 405b instruct model, receiving OOM. The way I understand is that I should load the original weights with vllm and then add the adapter as shown in their documentation. I’ve tried also loading the original meta weights as fp8 which failed as well.
The only model I managed to load with vllm was a 405b instruct fp8 original version by meta, downloaded separately from Huggingface.
So what I am asking is: can I change the config to tune with qlora the 405b instruct fp8 from meta instead of their 405b instruct as shown in the torchtune config
On Mon, 23 Dec 2024 at 17:52, Felipe Mello @.***> wrote:
hey @albertbn https://github.com/albertbn , i am having a bit of trouble understanding exactly what you are asking. Let me repeat it back to you what I got:
- You trained 405B QLoRA using torchtune without problems. You adapter is in bf16, and the base model was quantized for nf4 during trained. You only saved only the adapters
- You were able to successfully load the 405B as fp8, and you are also able to load the adapter
- The question is if its ok to run the model in fp8
Is that what you are asking?
If so, i believe that there are many articles showing that, for inference, fp8 shows no performance degradation, AFAIK. But maybe you could evaluate it and see if the performance is what you expected.
— Reply to this email directly, view it on GitHub https://github.com/pytorch/torchtune/issues/2093#issuecomment-2559940132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJB457OMP6JRB5PHHWUY4T2HAWVRAVCNFSM6AAAAABSXVZY4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJZHE2DAMJTGI . You are receiving this because you were mentioned.Message ID: @.***>
To avoid any confusion, I am further clarifying in brief:
meta-llama/Llama-3.1-405B-Instruct- fine tuned with qlora saving adapters only, as per original torchtune 405B config. Unable to load weights with VLLM in anyway - hitting OOMmeta-llama/Llama-3.1-405B-Instruct-FP8- able to load with VLLM. Question - can I change the 405B config to tune with qlora on that one?
@albertbn, i am not sure. @ebsmothers , do you know if we can easily replace the quantize_base in QLoRA with float8 instead of nf4?
I have my doubts though on how much extra performance you can get from it. It might not be worth it. Did you find some article showing someone finetuning with fp8 base and it being better?
@albertbn , related: https://github.com/pytorch/torchtune/issues/2201 . Lets maybe keep the convo there and close the issue here, since its not about NCCL anymore. What do you think?
@felipemello1
do you know if we can easily replace the quantize_base in QLoRA with float8 instead of nf4?
From what I know, existing solutions in torchao do not support this use case. torchao.float8 does not support only keeping FP8 weight. There is float8_dynamic_activation_float8_weight that creates AffineQuantizedTensor subclass, but it doesn't define back-prop, we can't train with it.
It should be straight-forward to implement, just no one creates it yet in torchao.
I am still seeing the issue with distributed lora on Llama-3.1-70B. I tried to follow https://github.com/pytorch/torchtune/issues/2093#issuecomment-2509733176 link to increase the timeout. I do have sufficient CPU-RAM to save the model-checkpoints. I am unsure of the path to debug this. Appreciate any help with this.
Here is the traceback of the error.
Traceback (most recent call last):
File "/home/pc/.conda/envs/testenv/bin/tune", line 8, in <module>
sys.exit(main())
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
parser.run(args)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
args.func(args)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
run(args)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/run.py", line 880, in run
elastic_launch(
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_lora_distributed FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-04_08:47:49
host : dldev01.host.com
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 216824)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 216824
@pranathichunduru , can you please share a longer log?
@pranathichunduru , can you please share a longer log?
This is from the start of the process:
Setting manual seed to local seed 2241663802. Local seed is seed + rank = 2241663802 + 0
Writing logs to /home/pc/pc/src/scratch/NLME-Darwin/torchtune/llama3_1_70B/lora/logs/log_1738697935.txt
FSDP is enabled. Instantiating model and loading checkpoint on Rank 0 ...
Compiling model layers with torch.compile...
Instantiating model and loading checkpoint took 165.87 secs
Memory stats after model init:
GPU peak memory allocation: 34.92 GiB
GPU peak memory reserved: 35.61 GiB
GPU peak memory active: 34.92 GiB
Optimizer is initialized.
Compiling loss with torch.compile...
Loss is initialized.
Dataset and Sampler are initialized.
Learning rate scheduler is initialized.
Profiling disabled.
Profiler config after instantiation: {'enabled': False}
0%| | 0/161 [00:00<?, ?it/s]/home/pc/.conda/envs/pydarwin/lib/python3.10/site-packages/torchtune/training/_activation_offloading.py:113: UserWarning: ***** WARNING: curr_pct=65.6% > self.virtual_memory_safe_pct=60% of virtual memory used
warn(
/home/pc/.conda/envs/pydarwin/lib/python3.10/site-packages/torchtune/training/_activation_offloading.py:113: UserWarning: ***** WARNING: curr_pct=65.6% > self.virtual_memory_safe_pct=60% of virtual memory used
warn(
/home/pc/.conda/envs/pydarwin/lib/python3.10/site-packages/torchtune/training/_activation_offloading.py:113: UserWarning: ***** WARNING: curr_pct=65.6% > self.virtual_memory_safe_pct=60% of virtual memory used
warn(
/home/pc/.conda/envs/pydarwin/lib/python3.10/site-packages/torchtune/training/_activation_offloading.py:113: UserWarning: ***** WARNING: curr_pct=65.7% > self.virtual_memory_safe_pct=60% of virtual memory used
warn(
...
1|157|Loss: 0.0007009985274635255: 98%|███████████████████████████████████████████████████████▉ | 158/161 [2:55:17<03:1|158|Loss: 0.0060004680417478085: 98%|███████████████████████████████████████████████████████▉ | 158/161 [2:55:17<03:1|158|Loss: 0.0060004680417478085: 99%|████████████████████████████████████████████████████████▎| 159/161 [2:56:23<02:
Saving checkpoint. This may take some time. Retrieving full model state dict...
Traceback (most recent call last):
File "/home/pc/.conda/envs/testenv/bin/tune", line 8, in <module>
sys.exit(main())
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
parser.run(args)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
args.func(args)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
run(args)
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/run.py", line 880, in run
elastic_launch(
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/pc/.conda/envs/testenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_lora_distributed FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-04_08:47:49
host : dldev01.host.com
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1194347)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1194347
Also this is from running dmesg | tail -n 50 | grep -i "oom" to check the OOM error for the PID.
[73601.522898] Out of memory: Killed process 1194347 (python3.1) total-vm:201694896kB, anon-rss:111854380kB, file-rss:75048kB, shmem-rss:9628272kB, UID:1008 pgtables:240660kB oom_score_adj:0
I am trying to finetune the Llama-3.1-70B LORA model on 4-A100's (80GB VRAM). I was able to finetune the smaller Llama-3.1-8B using Lora with
tune run lora_finetune_single_device --config configs/llama_3.1_8B_lora.yaml
and this ran fine without any errors. I am having trouble with distributed training on 70B model.
This is my config file:
# Config for multi-device LoRA in lora_finetune_distributed.py
# using a Llama3.1 70B model
#
# This config assumes that you've run the following command before launching
# this run:
# tune download meta-llama/Meta-Llama-3.1-70B-Instruct --output-dir /tmp/Meta-Llama-3.1-70B-Instruct --ignore-patterns "original/consolidated*"
#
# This config needs 8 GPUs to run
# tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_1/70B_lora
output_dir: /home/pc/pc/src/scratch/NLME-Darwin/torchtune/llama3_1_70B/lora # /tmp may be deleted by your system. Change it to your preference.
# Model Arguments
model:
_component_: torchtune.models.llama3_1.lora_llama3_1_70b
lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
apply_lora_to_mlp: True
apply_lora_to_output: False
lora_rank: 8 # higher increases accuracy and memory
lora_alpha: 16 # usually alpha=2*rank
lora_dropout: 0.0
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /home/pc/.cache/huggingface/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b/original/tokenizer.model
max_seq_len: null
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /home/pc/.cache/huggingface/hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/1605565b47bb9346c5515c34102e054115b4f98b/
checkpoint_files: [
model-00001-of-00030.safetensors,
model-00002-of-00030.safetensors,
model-00003-of-00030.safetensors,
model-00004-of-00030.safetensors,
model-00005-of-00030.safetensors,
model-00006-of-00030.safetensors,
model-00007-of-00030.safetensors,
model-00008-of-00030.safetensors,
model-00009-of-00030.safetensors,
model-00010-of-00030.safetensors,
model-00011-of-00030.safetensors,
model-00012-of-00030.safetensors,
model-00013-of-00030.safetensors,
model-00014-of-00030.safetensors,
model-00015-of-00030.safetensors,
model-00016-of-00030.safetensors,
model-00017-of-00030.safetensors,
model-00018-of-00030.safetensors,
model-00019-of-00030.safetensors,
model-00020-of-00030.safetensors,
model-00021-of-00030.safetensors,
model-00022-of-00030.safetensors,
model-00023-of-00030.safetensors,
model-00024-of-00030.safetensors,
model-00025-of-00030.safetensors,
model-00026-of-00030.safetensors,
model-00027-of-00030.safetensors,
model-00028-of-00030.safetensors,
model-00029-of-00030.safetensors,
model-00030-of-00030.safetensors,
]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: LLAMA3
resume_from_checkpoint: False
save_adapter_weights_only: True # Set to false to save the whole model + adapter merged
# Dataset and Sampler
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
data_files: /home/pc/pc/src/scratch/NLME-Darwin/data.jsonl
packed: False # True increases speed
conversation_column: messages
conversation_style: openai
seed: null
shuffle: True
batch_size: 2
# Optimizer and Scheduler
optimizer:
_component_: torch.optim.AdamW
fused: True
weight_decay: 0.01
lr: 3e-4
# offload_optimizer: True # Moves optimizer states to CPU before saving
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 100
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
# Training
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 1 # Use to increase effective batch size
compile: True # torch.compile the model + loss, True increases speed + decreases memory
# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True
# Environment
device: cuda
dtype: bf16
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: True # True reduces memory
# custom_sharded_layers: ['tok_embeddings', 'output'] # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
#offload_activations: True # More aggressive offloading
#offload_parameters: True # If running out of GPU RAM
# Profiler (disabled)
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False
#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs
#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True
#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False
# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1
Appreciate your help with this !
@joecummings , is the distributed checkpoint ready for lora distributed?
I am using Full Fine-Tuning and I still get this error
[rank0]:[E227 23:07:54.322446753 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619725, OpType=_ALLGATHER_BASE, NumelIn=54528000, NumelOut=218112000, Timeout(ms)=1800000) ran for 1800034 milliseconds before timing out.
[rank0]:[E227 23:07:54.322713545 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 619725 PG status: last enqueued work: 619732, last completed work: 619724
[rank0]:[E227 23:07:54.322728963 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E227 23:07:54.323599416 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619727, OpType=_REDUCE_SCATTER_BASE, NumelIn=4096, NumelOut=1024, Timeout(ms)=1800000) ran for 1800060 milliseconds before timing out.
[rank3]:[E227 23:07:54.323727781 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 619727 PG status: last enqueued work: 619730, last completed work: 619726
[rank3]:[E227 23:07:54.323739223 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E227 23:07:54.330858646 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619725, OpType=_ALLGATHER_BASE, NumelIn=54528000, NumelOut=218112000, Timeout(ms)=1800000) ran for 1800061 milliseconds before timing out.
[rank2]:[E227 23:07:54.331014928 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 619725 PG status: last enqueued work: 619732, last completed work: 619724
[rank2]:[E227 23:07:54.331025732 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E227 23:07:54.334413980 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619725, OpType=_ALLGATHER_BASE, NumelIn=54528000, NumelOut=218112000, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
[rank1]:[E227 23:07:54.334525713 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 619725 PG status: last enqueued work: 619732, last completed work: 619724
[rank1]:[E227 23:07:54.334558759 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E227 23:07:54.990792231 ProcessGroupNCCL.cpp:681] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E227 23:07:54.990834934 ProcessGroupNCCL.cpp:695] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E227 23:07:54.992160664 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619727, OpType=_REDUCE_SCATTER_BASE, NumelIn=4096, NumelOut=1024, Timeout(ms)=1800000) ran for 1800060 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x154e2830d1b6 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x154e29656c74 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x154e296587d0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x154e296596ed in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x154e726dc5c0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x154e7bbaec02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x154e7bc33c40 in /lib64/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=619727, OpType=_REDUCE_SCATTER_BASE, NumelIn=4096, NumelOut=1024, Timeout(ms)=1800000) ran for 1800060 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x154e2830d1b6 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x154e29656c74 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x154e296587d0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x154e296596ed in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x154e726dc5c0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x89c02 (0x154e7bbaec02 in /lib64/libc.so.6)
frame #6: <unknown function> + 0x10ec40 (0x154e7bc33c40 in /lib64/libc.so.6)
Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x154e2830d1b6 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5c6fc (0x154e292b46fc in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x154e726dc5c0 in /home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x89c02 (0x154e7bbaec02 in /lib64/libc.so.6)
frame #4: <unknown function> + 0x10ec40 (0x154e7bc33c40 in /lib64/libc.so.6)
W0227 23:07:55.697000 249042 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 249078 closing signal SIGTERM
W0227 23:07:55.717000 249042 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 249079 closing signal SIGTERM
W0227 23:07:55.718000 249042 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 249080 closing signal SIGTERM
E0227 23:07:56.813000 249042 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 249081) of binary: /home/mantrik/.conda/envs/hf/bin/python3.10
Running with torchrun...
Traceback (most recent call last):
File "/home/mantrik/.conda/envs/hf/bin/tune", line 8, in <module>
sys.exit(main())
File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
run(args)
File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/mantrik/.conda/envs/hf/lib/python3.10/site-packages/recipes/full_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-27_23:07:55
host : h001.gautschi.rcac.purdue.edu
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 249081)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 249081
============================================================
same
I encountered the same WorkNCCL timeout during checkpoint saving and wanted to share my investigation process and solutions that might help others.
Problem Description
[rank13]:[E801 17:37:10.950994062 ProcessGroupNCCL.cpp:633] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3194, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
Investigation Process
Initial Analysis:
- Timeout occurred after all model checkpoint files were successfully saved (model-00001-of-00016.safetensors to model-00016-of-00016.safetensors)
- Missing expected log:
"Saving checkpoint took X.XX secs"
Key Code Flow Analysis:
The timeout happens in _save_checkpoint_sync() method:
# In torchtune/training/checkpointing/_checkpoint_client.py
def _save_checkpoint_sync(self, ...):
def _save_checkpoint_helper():
# Save model files (✅ completed successfully)
self._get_checkpointer().save_checkpoint(...)
# This log was never reached ❌
log.info(f"Saving checkpoint took {time.perf_counter() - cp_start:.2f} secs")
if is_not_distributed_checkpointer and not single_device:
if self._is_rank_zero:
_save_checkpoint_helper() # Rank 0 gets stuck here
torch.distributed.barrier() # Other ranks timeout waiting here
Root Cause Identified:
The bottleneck is saving recipe_state.pt which contains:
- Optimizer state dict (~120GB for 30B model with AdamW)
- Training progress metadata
- Dataloader state
For large models, serializing and writing optimizer states can take 5-10+ minutes, causing other ranks to timeout at the barrier.
Solutions
properly extend the collective timeout
from datetime import timedelta
# delay ProcessGroupNCCL.cpp Watchdog caught collective operation timeout to 1 hour
init_process_group(backend=self.distributed_backend, timeout=timedelta(seconds=3600))
you need to explicitly set the timeout parameter in full_finetune_distributed.py when calling torch.distributed.init_process_group.
This way, PyTorch will pass the 1-hour timeout down to the C++ layer, and NCCL collective operations will wait up to 1 hour before timing out.
Hope this helps others encountering similar issues! The key is identifying whether the timeout is during actual distributed communication or just waiting for slow I/O operations.