LaVIN
LaVIN copied to clipboard
Questions about training related to NCCL
Thanks for your excellent work. I run into the following mistake when I excecute the train.py.
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15927, OpType=ALLREDUCE, Timeout(ms)=5400000) ran for 5403515 milliseconds before timing out.
This happens after the log information "begin synchronizing". Do you know how to solve this problem?