Paddle [WIP][Distributed] FlashEP: A Flexible and General Strategy for Deep Communication-Computation Overlap for Mixture-of-Experts

PR Category

Distributed Strategy

PR Types

New features

Description

In large-scale, highly sparse Mixture-of-Experts (MoE) training, cross-machine communication time accounts for a significant portion of the end-to-end training time, leading to suboptimal overall performance. To address this issue, we propose FlashEP: A Flexible and General Strategy for Deep Communication-Computation Overlap for Mixture-of-Experts —based on DeepEP, tailored for large-scale, highly sparse MoE training scenarios. The core idea is to split the complete all-to-all communication at the granularity of experts, thereby decoupling the data dependency between all-to-all communication and expert computation as much as possible, enabling overlapping of expert computation with all-to-all communication. We provide a flexible, non-intrusive communication solution that does not require modifying the user’s GEMM kernel. Users only need to encapsulate expert computation into forward and backward interfaces and provide them to FlashEP. Additionally, FlashEP employs an asymmetric communication pattern to ensure that no redundant communication occurs in the network links during the dispatch and combine phases, thereby guaranteeing communication throughput.

According to our tests, FlashEP achieves up to 20% performance improvement at the operator level and up to 10% end-to-end training performance improvement.

Fine-grained all2all overlap： 69518339dee3643517efed039edc0489

Asymmetric token communication： 855c0619c3f9294202c1759abd8cd9c4

Zero redundancy in communication links： e109faff62697114539d6fe7b52d947b

devPR:https://github.com/PaddlePaddle/Paddle/pull/76497

Nov 27 '25 06:11 zhangyuqin1998

你的PR提交成功，感谢你对开源项目的贡献! 请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。 Your PR has been submitted. Thanks for your contribution! Please wait for the result of CI firstly. See Paddle CI Manual for details.

Nov 27 '25 06:11 paddle-bot[bot]

/re-run all-failed

Nov 27 '25 14:11 zhangyuqin1998

/re-run all-failed

Nov 28 '25 01:11 zhangyuqin1998