DeepEP
DeepEP copied to clipboard
Can DeepEP run correctly in cudaGraph mode?
Two questions:
- Can DeepEP run normally in cudaGraph mode?
- Does DeepEP perform "dropless" MoE dispatch? (i.e. no token discarded if tokens are heavily routed to a limited number of experts)
- Yes, but only for the normal kernels;
- Yes; If you want to drop tokens, you should perform at the gate (masking some
topk_idxinto-1), DeepEP supports ignoring-1expert selection (no send for such cases).
@LyricZhao Thank you, for drop-less dispatch, will the utilization still be that fast when gating selection is imbalanced (e.g. all tokens routed to the same GPU)?
will the utilization still be that fast when gating selection is imbalanced
The overall performance will be bound at the imbalanced rank. In the terms of the imbalanced rank itself, the utilization should be full.
- Yes, but only for the normal kernels;
In sglang, low-latancy kernel could run normally in cudaGraph Mode!
Why only for the normal kernels? @LyricZhao