Megatron-LM [QUESTION] deepep vs origin all-to-all

@yanring @ko3n1g I'm very grateful to the experts at NVIDIA for quickly integrating the code of DeepEP. However, in my single - machine tests, I found that the training speed is actually slower than that of the traditional all - to - all method. May I ask if you have any test results for me to refer to? My config of PP=2，EP=4 and MoE config：

HIDDEN_SIZE=2048 #7168 #7168
NUM_ATTN_HEADS=128
NUM_LAYERS=8 #61 #61
INTERMEDIATE_SIZE=18432
MOE_INTERMEDIATE_SIZE=2048
MAX_POSITION_EMBEDDINGS=163840

EXTRA_VOCAB_SIZE=1280

Q_LORA_RANK=1536
KV_LORA_RANK=512
QK_NOPE_HEAD_DIM=128
QK_ROPE_HEAD_DIM=64
V_HEAD_DIM=128
ROPE_THETA=10000
SCALE_FACTOR=40
NUM_EXPERTS=32 #256 #64 #256
ROUTER_TOPK=8
NUM_SHARED_EXPERTS=1
MOE_LAYER_FREQ=0 #3

Mar 07 '25 07:03 zhanjiqing

DeepEP is optimized for large topk with cross-node EP(EP>8) scenarios. Based on our experience, for EP<=8, allgather or alltoall dispatchers are more recommended.

Mar 07 '25 14:03 yanring

DeepEP is optimized for large topk with cross-node EP(EP>8) scenarios. Based on our experience, for EP<=8, allgather or alltoall dispatchers are more recommended.

Thanks for your response. I will test it on a more powerful device next and EP>8.

Mar 09 '25 03:03 zhanjiqing

Marking as stale. No activity in 60 days.

May 08 '25 18:05 github-actions[bot]

Based on the feedback, it seems the initial question has been addressed with recommendations for DeepEP usage. I’ll close this issue now, but please feel free to reopen it if further assistance is needed.

Jul 18 '25 00:07 sbhavani