wangdaw2023
wangdaw2023
参考其它项目EP专家并行的代码库如DeepEP, megatron, SGLang,不同GPU加载不同专家权重,forward/backward计算需要有activation的dispatch拆分,每个选中的专家在不同GPU的计算,最后combine合并。代码比minimind的实现复杂多了。
训练中偶发nan的化,可以判断loss是否nan然后skip这个step,不更新权重。
@FrankLeeeee @teadross , we encountered this issue with 4 A800 nodes. It is the slow weight loading, which causes torch ddp c10d watchdog timeout. You need to update sglang code...
try --tp 16 and set --dist-init-addr the same value for both nodes. Except --node-rank 0, other parameters should be the same.