LaVIN icon indicating copy to clipboard operation
LaVIN copied to clipboard

Questions about training related to NCCL

Open air-tea opened this issue 1 year ago • 3 comments

Thanks for your excellent work. I run into the following mistake when I excecute the train.py.

[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15927, OpType=ALLREDUCE, Timeout(ms)=5400000) ran for 5403515 milliseconds before timing out.

This happens after the log information "begin synchronizing". Do you know how to solve this problem?

air-tea avatar Aug 23 '23 07:08 air-tea