DeepEP Can DeepEP run correctly in cudaGraph mode?

Two questions:

Can DeepEP run normally in cudaGraph mode?
Does DeepEP perform "dropless" MoE dispatch? (i.e. no token discarded if tokens are heavily routed to a limited number of experts)

May 17 '25 13:05 ghostplant

Yes, but only for the normal kernels;
Yes; If you want to drop tokens, you should perform at the gate (masking some topk_idx into -1), DeepEP supports ignoring -1 expert selection (no send for such cases).

May 22 '25 09:05 LyricZhao

@LyricZhao Thank you, for drop-less dispatch, will the utilization still be that fast when gating selection is imbalanced (e.g. all tokens routed to the same GPU)?

May 22 '25 14:05 ghostplant

will the utilization still be that fast when gating selection is imbalanced

The overall performance will be bound at the imbalanced rank. In the terms of the imbalanced rank itself, the utilization should be full.

May 23 '25 02:05 LyricZhao

Yes, but only for the normal kernels;

In sglang, low-latancy kernel could run normally in cudaGraph Mode!

Why only for the normal kernels? @LyricZhao

Jun 24 '25 03:06 alpha-baby