Vincent Haines

Results 2 comments of Vincent Haines

I got a similar error doing a finetune of mixtral: [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=8378, OpType=REDUCE, NumelIn=100700160, NumelOut=100700160, Timeout(ms)=1800000) ran for 1800848 milliseconds before timing...

there is not an error before that. Basically it would train fine up to 40 steps then say it timed out. I think its something to do with comunication between...