wangdaw2023 comments

Repositories
Issues
Comments

Results 4 comments of


                                            wangdaw2023

如果想把MoE模型中的不同Expert加载到不同的GPU上，请问这个在工程上有什么思路做吗？

参考其它项目EP专家并行的代码库如DeepEP, megatron, SGLang，不同GPU加载不同专家权重，forward/backward计算需要有activation的dispatch拆分，每个选中的专家在不同GPU的计算，最后combine合并。代码比minimind的实现复杂多了。

dpo train loss nan

训练中偶发nan的化，可以判断loss是否nan然后skip这个step，不更新权重。

[Bug] Watchdog caught collective operation timeout

@FrankLeeeee @teadross , we encountered this issue with 4 A800 nodes. It is the slow weight loading, which causes torch ddp c10d watchdog timeout. You need to update sglang code...

Can not load Deepseek model in two nodes

try --tp 16 and set --dist-init-addr the same value for both nodes. Except --node-rank 0, other parameters should be the same.