DeepEP Profiling when rank unbalance

Thank you very much for your contributions to the DeepEP! I have a question about the latency of batch per rank unbalance scenario.

The above figure is my theoretical estimation of the impact of different RANK's batch size per rank on the dispatch + combine delay in the Decode stage. According to the estimation, the difference in batch size per rank among different RANKs, that is, batch size unbalance, has a very small effect on the end-to-end delay of Decode, which is about (220 - 210) / 210 = 4.7%. Is this theoretical estimation correct? It seems a bit counterintuitive. @sphish

Jun 24 '25 05:06 nannaer

There are two points you haven't considered:

There is a significant difference in the recv bandwidth between NVLink and RDMA.
The unbalanced sending caused by unbalanced batch sizes can greatly reduce the fabric utilization.

By the way, may I ask why you have been focusing on the case of unbalanced batch size?

Jun 24 '25 07:06 sphish

There are two points you haven't considered:

There is a significant difference in the recv bandwidth between NVLink and RDMA.

The unbalanced sending caused by unbalanced batch sizes can greatly reduce the fabric utilization.

By the way, may I ask why you have been focusing on the case of unbalanced batch size?

Thank you very much for your detailed explanation! My supervisor has assigned me a research task, which is to optimize the inference service of large EP. I want to find a sweet spot between balanced batch size and balanced KV Cache. One of the scenarios I'm optimizing is KV Cache balancing with unbalanced batch size. However, I don't have a good understanding of the impact of unbalanced batch size on dispatch/combine, so I can only seek help from you, an expert, by asking questions. I hope I can get your help~

Jun 24 '25 07:06 nannaer

A relatively simple theoretical calculation method is to consider the number of tokens each GPU receives from the RDMA network, as this is usually the bottleneck. In the unbalanced scenarios you mentioned, each GPU on Machine 2 needs to receive 192 tokens from RDMA (actually, this should be multiplied by top_k), while in the balanced case, each GPU only needs to receive 112 tokens from RDMA. Therefore, the difference between them should be 192 / 112.

Jun 25 '25 01:06 sphish

A relatively simple theoretical calculation method is to consider the number of tokens each GPU receives from the RDMA network, as this is usually the bottleneck. In the unbalanced scenarios you mentioned, each GPU on Machine 2 needs to receive 192 tokens from RDMA (actually, this should be multiplied by top_k), while in the balanced case, each GPU only needs to receive 112 tokens from RDMA. Therefore, the difference between them should be 192 / 112.

Expert, thank you very much for your detailed answer! However, I still have one thing I don't understand and would like to ask you: Should the tokens received by a RANK of machine 2 be estimated in this way?

(1) In the unbalanced case, a RANK of machine 2 has to RECV tokens from all RANKs of machine 1, which is (384/16)*8=192 tokens, and also RECV tokens from other RANKs of machine 2, which is (64/16)*7=28 tokens, so it should receive a total of 220 tokens.

(2) In the balance case, a RANK of machine 2 should receive (224/16)*15=210 tokens from both machine 1 and machine 2. But actually, my evaluation results are as follows, which are similar to your estimation, but much higher than the performance loss I estimated. （The version I used is 9fe9021 IBGDA ONLY. It is the version that only uses RDMA and does not use NVLink.）

Unabalance:

[rank 0] Dispatch + combine bandwidth: 11.53 GB/s, avg_t=947.00 us, min_t=850.88 us, max_t=1053.18 us

Balance:

[rank 7] Dispatch + combine bandwidth: 48.17 GB/s, avg_t=573.28 us, min_t=558.05 us, max_t=590.14 us

Finally, I also want to ask how to quantitatively calculate the impact of “The unbalanced sending caused by unbalanced batch sizes can greatly reduce the fabric utilization.” on the latency of a RANK.

Thanks!!!!!

Jun 25 '25 14:06 nannaer

If it’s a pure RDMA version, your estimation method is correct. However, the bottleneck also exists on the sender side. Although I previously mentioned that the send operation is asynchronous, the NIC still needs time to actually transmit the data. In the imbalanced scenario, each GPU on machine 1 needs to send 192 tokens, while in the balanced scenario, it only needs to send 112 tokens. That’s the reason for the time difference.

Notice: I am only referring to theoretical calculations. In actual tests, there are many more factors to consider, such as network routing and congestion control. Also, there is no synchronization in the current benchmark scripts. However, if you want to accurately measure timing in imbalanced scenarios, you need to add extra synchronization between the dispatch and combine phases.

Jun 26 '25 01:06 sphish

如果它是纯 RDMA 版本，则您的估计方法是正确的。但是，瓶颈也存在于发送方。尽管我之前提到过发送作是异步的，但 NIC 仍然需要时间来实际传输数据。在不均衡场景下，机器 1 上的每个 GPU 需要发送 192 个 Token，而在 Balanced 场景中，它只需要发送 112 个 Token。这就是时差的原因。

注意：我只是指理论计算。在实际测试中，还有更多因素需要考虑，例如网络路由和拥塞控制。此外，当前基准测试脚本中没有同步。但是，如果要在不平衡的情况下准确测量计时，则需要在 dispatch 和 combine 阶段之间添加额外的同步。

Thank you very much for your detailed explanation! I ran DeepEP using 4 H800 machines (each machine has 8 GPUs), and here are my performance test results(disable overlap). I'm not sure if my settings and the resulting outputs are correct. Could you help me check them? (In the figure, RDMA means setting allow_nvlink_for_low_latency_mode=False and allow_mnnvl=False; NVLINK means both parameters are set to True.) The horizontal axis represents the number of tokens per rank. @sphish

Thanks very much!!!

Jul 11 '25 07:07 nannaer

Thanks both for the discussion. In R1 inference, I also noticed that there can be a significant difference in the DeepEP stage when there is an imbalance in tokens (with/without MTP).

Specifically, without enabling TBO, for once of the target model forward:

Case 1: Whenbs=128 (without MTP),dispatch 41ms, combine 12ms.
Case 2: When bs=64 with MTP=1, dispatch 72ms, combine 18ms.

The time was measured during a stable inference run stage. that is to say, in Case 1, bs is a stable 128. However, in Case 2,

since MTP is enabled and the acceptance rate varies, some reqs maybe early stop, the bs for each rank might not be the same(64*2).
for the draft extend stage, the acceptance rate may cause larger difference of bs on each rank(8, 14, 5, 6, 12 ,137 ,5 for rank0-8)
I wonder if the tensor shape of (bs,1)、(bs/2,2) will have impact on performance.

The difference in time between the two cases is quite large. I wonder if anyone have encountered a similar issue? @sphish

Jul 31 '25 07:07 allan0703