wangdaw2023

Results 4 comments of wangdaw2023

参考其它项目EP专家并行的代码库如DeepEP, megatron, SGLang,不同GPU加载不同专家权重,forward/backward计算需要有activation的dispatch拆分,每个选中的专家在不同GPU的计算,最后combine合并。代码比minimind的实现复杂多了。

训练中偶发nan的化,可以判断loss是否nan然后skip这个step,不更新权重。

@FrankLeeeee @teadross , we encountered this issue with 4 A800 nodes. It is the slow weight loading, which causes torch ddp c10d watchdog timeout. You need to update sglang code...

try --tp 16 and set --dist-init-addr the same value for both nodes. Except --node-rank 0, other parameters should be the same.