OFA Watchdog caught collective operation timeout: WorkNCCL

Thank you for your contribution to this seminal work!

I tried to pretrain a model from scratch, and I prepare the dataset following the examples you provided. Specifically, I collect the CC12M dataset and split it into multiple tsv files. Each tsv is iterated over an epoch.

When I run it on a single machine, 8V100 server, everything goes just fine. however, I run into a timeout issue when I switch to 8 nodes 8 V100 servers with fsdp. The logs are attached below.

As you can see, the code is stuck at epoch 2. Considering that the tsv file is work well on one node server, I can hardly find out what's wrong with the 8 nodes' training.

Any help would be appreciated!

`{"@timestamp":"2022-10-14 21:12:36.194","@message":"2022-10-14 21:12:36 - trainer.py[line:703] - INFO: begin training epoch 2"} {"@timestamp":"2022-10-14 21:12:36.195","@message":"2022-10-14 21:12:36 - train.py[line:310] - INFO: Start iterating over samples"} {"@timestamp":"2022-10-14 21:15:52.756","@message":"2022-10-14 21:15:52 - progress_bar.py[line:272] - INFO: epoch 002: 8 / 139 loss=5.45, loss_v1=5.256, loss_v2=15.13, nll_loss=5.451, ntokens=137705, nsentences=5120, sample_size=22600.1, sample_size_v1=4501.9, sample_size_v2=15719.2, ppl=43.73, wps=4704.7, ups=0.03, wpb=137705, bsz=5120, num_updates=140, lr=2.0438e-05, gnorm=1.775, clip=0, loss_scale=4, train_wall=249, gb_free=11.6, wall=3925"}

{"@timestamp":"2022-10-14 21:20:01.004","@message":"2022-10-14 21:20:01 - progress_bar.py[line:272] - INFO: epoch 002: 18 / 139 loss=5.437, loss_v1=0, loss_v2=0, nll_loss=5.436, ntokens=140527, nsentences=5120, sample_size=22531.9, sample_size_v1=0, sample_size_v2=0, ppl=43.31, wps=5660.7, ups=0.04, wpb=140527, bsz=5120, num_updates=150, lr=2.18978e-05, gnorm=1.625, clip=0, loss_scale=4, train_wall=248, gb_free=11.5, wall=4173"}

{"@timestamp":"2022-10-14 21:24:09.351","@message":"2022-10-14 21:24:09 - progress_bar.py[line:272] - INFO: epoch 002: 28 / 139 loss=5.412, loss_v1=0, loss_v2=0, nll_loss=5.412, ntokens=140526, nsentences=5120, sample_size=22575.4, sample_size_v1=0, sample_size_v2=0, ppl=42.58, wps=5658.5, ups=0.04, wpb=140526, bsz=5120, num_updates=160, lr=2.33577e-05, gnorm=1.542, clip=0, loss_scale=4, train_wall=248, gb_free=11.3, wall=4422"}

{"@timestamp":"2022-10-14 21:28:20.877","@message":"2022-10-14 21:28:20 - progress_bar.py[line:272] - INFO: epoch 002: 38 / 139 loss=5.421, loss_v1=0, loss_v2=0, nll_loss=5.421, ntokens=138168, nsentences=5120, sample_size=22479.5, sample_size_v1=0, sample_size_v2=0, ppl=42.85, wps=5493.2, ups=0.04, wpb=138168, bsz=5120, num_updates=170, lr=2.48175e-05, gnorm=1.693, clip=0, loss_scale=4, train_wall=251, gb_free=10.7, wall=4673"}

{"@timestamp":"2022-10-14 21:32:31.029","@message":"2022-10-14 21:32:31 - progress_bar.py[line:272] - INFO: epoch 002: 48 / 139 loss=5.376, loss_v1=0, loss_v2=0, nll_loss=5.375, ntokens=138904, nsentences=5120, sample_size=22302, sample_size_v1=0, sample_size_v2=0, ppl=41.51, wps=5552.8, ups=0.04, wpb=138904, bsz=5120, num_updates=180, lr=2.62774e-05, gnorm=1.704, clip=0, loss_scale=8, train_wall=250, gb_free=11, wall=4923"}

{"@timestamp":"2022-10-14 21:36:35.982","@message":"2022-10-14 21:36:35 - progress_bar.py[line:272] - INFO: epoch 002: 58 / 139 loss=5.351, loss_v1=0, loss_v2=0, nll_loss=5.351, ntokens=138445, nsentences=5120, sample_size=22308.6, sample_size_v1=0, sample_size_v2=0, ppl=40.82, wps=5651.9, ups=0.04, wpb=138445, bsz=5120, num_updates=190, lr=2.77372e-05, gnorm=1.589, clip=0, loss_scale=8, train_wall=245, gb_free=12.4, wall=5168"}

{"@timestamp":"2022-10-14 21:40:41.163","@message":"2022-10-14 21:40:41 - progress_bar.py[line:272] - INFO: epoch 002: 68 / 139 loss=5.381, loss_v1=0, loss_v2=0, nll_loss=5.381, ntokens=139566, nsentences=5120, sample_size=22367, sample_size_v1=0, sample_size_v2=0, ppl=41.66, wps=5692.4, ups=0.04, wpb=139566, bsz=5120, num_updates=200, lr=2.91971e-05, gnorm=1.687, clip=0, loss_scale=8, train_wall=245, gb_free=11.1, wall=5414"}

{"@timestamp":"2022-10-14 22:12:50.466","@message":"[E ProcessGroupNCCL.cpp:719] [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9473, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808669 milliseconds before timing out."}

{"@timestamp":"2022-10-14 22:12:50.466","@message":"[E ProcessGroupNCCL.cpp:719] [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9473, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808671 milliseconds before timing out."}`

env: torch 1.11.0+cu113 fairscale 0.4.10 Python 3.8.13

Oct 18 '22 12:10 jinxixiang

I encountered the same problem when training on a single 8*V100 node with our own datasets. The code and environment are unchanged and worked fine before yesterday. However, this problem occurs after 6 to 9 hours of training (after tens of epochs randomly).

[E ProcessGroupNCCL.cpp:737] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=469508, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802212 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=469508, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802211 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=469508, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802212 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=469508, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802210 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=469508, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802220 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=469508, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802220 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=469508, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802209 milliseconds before timing out.

Oct 21 '22 03:10 JamesZhutheThird

Do you make any progress? @JamesZhutheThird

Oct 22 '22 12:10 jinxixiang

Well, I run it on another node for a whole day, and it's just fine. So I think it's the node to blame. Rebooting the node might work, but I haven't tried it yet.

Oct 22 '22 12:10 JamesZhutheThird