[WIP][Distributed] FlashEP: A Flexible and General Strategy for Deep Communication-Computation Overlap for Mixture-of-Experts
PR Category
Distributed Strategy
PR Types
New features
Description
In large-scale, highly sparse Mixture-of-Experts (MoE) training, cross-machine communication time accounts for a significant portion of the end-to-end training time, leading to suboptimal overall performance. To address this issue, we propose FlashEP: A Flexible and General Strategy for Deep Communication-Computation Overlap for Mixture-of-Experts —based on DeepEP, tailored for large-scale, highly sparse MoE training scenarios. The core idea is to split the complete all-to-all communication at the granularity of experts, thereby decoupling the data dependency between all-to-all communication and expert computation as much as possible, enabling overlapping of expert computation with all-to-all communication. We provide a flexible, non-intrusive communication solution that does not require modifying the user’s GEMM kernel. Users only need to encapsulate expert computation into forward and backward interfaces and provide them to FlashEP. Additionally, FlashEP employs an asymmetric communication pattern to ensure that no redundant communication occurs in the network links during the dispatch and combine phases, thereby guaranteeing communication throughput.
According to our tests, FlashEP achieves up to 20% performance improvement at the operator level and up to 10% end-to-end training performance improvement.
Fine-grained all2all overlap:
Asymmetric token communication:
Zero redundancy in communication links:
devPR:https://github.com/PaddlePaddle/Paddle/pull/76497
你的PR提交成功,感谢你对开源项目的贡献! 请关注后续CI自动化测试结果,详情请参考Paddle-CI手册。 Your PR has been submitted. Thanks for your contribution! Please wait for the result of CI firstly. See Paddle CI Manual for details.
/re-run all-failed
/re-run all-failed