socraties
Results
1
comments of
socraties
> > 大概卡了多久呢,这套代码没有对MoE优化过,所以训练确实会比较慢,30B MoE的速度大概和38B差不多,不一定是卡住了 > > [@Weiyun1025](https://github.com/Weiyun1025) 30min后,nccl timeout > > [rank3]:[E902 12:24:58.761850774 ProcessGroupNCCL.cpp:632] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=77419, OpType=_ALLGATHER_BASE, NumelIn=98304, NumelOut=1572864, Timeout(ms)=1800000) ran for 1800024 milliseconds...