DeepEP [Feature] Per Expert Overlap (PEO)

Background

The insight of this work is that we group the experts, allowing the communication of some experts to overlap with the computation of other experts. We call this approach Per Expert Overlap (PEO). Compared to existing methods, our approach has the following advantages:

1. Performance:

Compared to Non-overlap, PEO performs better at all batch sizes.
- For the DPSK model:
  - At batch size 4, PEO achieves an 11% improvement.
  - At batch size 128, PEO achieves a 31% improvement. The larger the batch size, the more significant the gain.
- For the QWEN model, PEO achieves up to a 51% improvement.
Compared to PR 390 (https://github.com/deepseek-ai/DeepEP/pull/390), PEO also performs better.

2. Usability

Compared to PR 390, PEO only modifies DeepEP and does not change DeepGEEM, making it easier to use.

In short, during the dispatch phase, we change the order of communication (by modifying DeepEP) to allow some experts to receive tokens first. During the GEMM phase, we change the order of computation (by modifying how the inference engine calls DeepGEMM) to allow some experts to compute first. In the combine phase, we let some experts send tokens first. Overall, this allows the communication of some experts to overlap with the computation of others.

Design

In the original DeepEP, each communication unit consists of num_experts or num_local_experts experts. That is, during the dispatch phase, each rank sends tokens to num_experts experts. During the combine phase, each rank sends tokens from num_local_experts experts to num_ranks ranks.

This solution modifies DeepEP by dividing the experts into num_rounds groups, and the communication is divided into num_rounds rounds.

During the dispatch phase, in each round, each rank sends tokens to num_experts // num_rounds experts.
During the combine phase, in each round, each rank sends tokens from num_local_experts // num_rounds local experts to num_ranks ranks.

The process is shown as follows:

Due to differences in model parameters, deployment scale, and batch size, this solution allows the following adjustable parameters to achieve the best overlap effect in different scenarios:

Parameters for Overlap:

Overlap method
- We tested different overlap methods and found they have different effects. Consider the following options:
  - overlap-1: After all dispatch sends are completed, then perform dispatch recv and GEMM.
  - overlap-2: Immediately after each dispatch send, perform recv + gemm.
  - overlap-3: Immediately after each dispatch send, perform recv + gemm, and allow DeepEP's send and recv to overlap.
  - overlap-4: No overlap between dispatch and GEMM.
num_rounds: Number of rounds for splitting dispatch/combine.
deepep_send_num_sms: Number of SMs used for dispatch/combine send.
deepep_recv_num_sms: Number of SMs used for dispatch/combine recv.
up_deepgemm_num_sms: Number of SMs used for UP GEMM.
down_deepgemm_num_sms: Number of SMs used for DOWN GEMM.

Performance

Configuration:

H20, EP16, QWEN, DPSK

Comparison Methods:

non-overlap
PR 390 (https://github.com/deepseek-ai/DeepEP/pull/390) (SBO)

Conclusion:

For both DPSK and QWEN, the overlap performance is the best at almost all batch sizes. For DPSK, PEO achieves a maximum of 31% improvement at batch size 128. For QWEN, PEO achieves a maximum of 50% improvement at batch size 16.

Nov 13 '25 08:11 k-ling3

Hi, to achieve Parameters for Overlap, should we modify in sglang forward and launch multi time gemm( to caculate different expert group ) ?

Nov 17 '25 08:11 rubbberrabbit

Hi, to achieve Parameters for Overlap, should we modify in sglang forward and launch multi time gemm( to caculate different expert group ) ?

Yes. We need to modify the inference engine to launch multiple GEMM operations for different expert groups. You can refer to SGLang's PR for details: https://github.com/sgl-project/sglang/pull/13442

Nov 18 '25 07:11 k-ling3