DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Benchmark test over RoCE network

Open yanminjia opened this issue 9 months ago • 18 comments

We ran test_internode.py over RoCE network with 4 H800 servers with 8 GPUs as per one server. But the test result is pretty poor by comparing with the case of 4 H800-servers on IB network.

case#1, 4 H800 servers on IB network Image

case#2, 4 H800 servers on RoCE nework

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 8, RDMA chunk 8: 29.92 GB/s (RDMA), 60.35 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 12, RDMA chunk 4: 29.54 GB/s (RDMA), 59.58 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 16: 13.59 GB/s (RDMA), 27.41 GB/s (NVL)

I am not sure if we have the benchmark test result on RoCE network. Additionally, it would be highly appreciated if any comment.

Many thanks.

yanminjia avatar Mar 19 '25 06:03 yanminjia

  • If you didn't enable adaptive routing in your RoCE network environment, there may be some routing issue. You can check the perf counters on the NIC to see if there is congestion.
  • If you enabled adaptive routing, this might be related to performance issues with RDMA atomics on RoCE networks.

sphish avatar Mar 19 '25 06:03 sphish

Adaptive routing is not enabled on our RoCE network. In fact, adaptive routing is not always available on RoCE network. Additionally, based on my understanding, adaptive routing is used for better traffic balance based on per-packet hashing on the switch side. And I don't think it can resolve the congestion problem. After all, in case of the traffic behavior of DeepEP, the traffic from multiple rdma_ranks will go into the same target rdma_rank even if adaptive routing is enabled.

yanminjia avatar Mar 19 '25 07:03 yanminjia

Even though traffic from multiple rdma_ranks flows into the same destination rdma_rank, congestion won't occur as long as the traffic is distributed evenly. In DeepEP, the main causes of uneven traffic distribution are:

  1. Uneven routing
  2. Uneven workload
  3. Uneven sending behavior

The latter two factors aren't significant in our test cases. In our experiments, we found that once you enable adaptive routing to address the uneven routing issue, we observe almost no congestion when running test_internode.py. However, ECMP (Equal-Cost Multi-Path) routing almost always results in routing collisions.

Adaptive routing is not enabled on our RoCE network. In fact, adaptive routing is not always available on RoCE network. Additionally, based on my understanding, adaptive routing is used for better traffic balance based on per-packet hashing on the switch side. And I don't think it can resolve the congestion problem. After all, in case of the traffic behavior of DeepEP, the traffic from multiple rdma_ranks will go into the same target rdma_rank even if adaptive routing is enabled.

sphish avatar Mar 19 '25 08:03 sphish

We tested an early version of DeepEP (without atomic operations) on a RoCE network. When using the default ECMP routing, we also observed huge performance degradation. However, after switching to static routing, the performance became nearly identical to that of InfiniBand.

sphish avatar Mar 19 '25 08:03 sphish

I think congestion would happen if traffic of multiple rdma_ranks goes into the same target rdma_rank at the same time unless some kind of traffic scheduling mechanism was implemented in DeepEP. Therefore, Maybe the congestion is mitegated by the credit based link-level flow control on IB network.

yanminjia avatar Mar 19 '25 08:03 yanminjia

We tested an early version of DeepEP (without atomic operations) on a RoCE network. When using the default ECMP routing, we also observed huge performance degradation. However, after switching to static routing, the performance became nearly identical to that of InfiniBand.

We can test this case with the 4 GPU servers within the same L2 network block if the environment is available. So the traffic will not be ECMP routed. Based on my understanding, the RDMA communication in DeepEP is restricted the GPUs with the same order on different GPU servers.

yanminjia avatar Mar 19 '25 08:03 yanminjia

Hello,

We also test test_internode.py over RoCE network with 4 H800 servers, each equipped with 8 GPUs and 8 dual-port CX7 RNICs. We got similar throughput, about half of the throughput of IB:

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 8, RDMA chunk 12: 31.28 GB/s (RDMA), 62.80 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 20, RDMA chunk 12: 31.83 GB/s (RDMA), 63.90 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 8: 31.52 GB/s (RDMA), 63.29 GB/s (NVL)

However, we have not yet modified IBRC in nvshmem to support dual port CX7, as mentioned in https://github.com/deepseek-ai/DeepEP/issues/74. @yanminjia Could you confirm whether the proposed modification to IBRC to enable dual-port utilization is valid and expected to improve performance? Additionally, if there are any specific considerations or potential risks associated with this change, please advise.

Thank you for your guidance!

VoidStardust avatar Mar 25 '25 02:03 VoidStardust

Hello,

We also test test_internode.py over RoCE network with 4 H800 servers, each equipped with 8 GPUs and 8 dual-port CX7 RNICs. We got similar throughput, about half of the throughput of IB:

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 8, RDMA chunk 12: 31.28 GB/s (RDMA), 62.80 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 20, RDMA chunk 12: 31.83 GB/s (RDMA), 63.90 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 8: 31.52 GB/s (RDMA), 63.29 GB/s (NVL)

However, we have not yet modified IBRC in nvshmem to support dual port CX7, as mentioned in #74. @yanminjia Could you confirm whether the proposed modification to IBRC to enable dual-port utilization is valid and expected to improve performance? Additionally, if there are any specific considerations or potential risks associated with this change, please advise.

Thank you for your guidance!

In theory, it should improve performance. After all, only half bandwidth is utilized with the current nvshmem code in case of dual-port CX7. As #74 mentioned, should ensure the rma traffic and the associated amo operation with same channel in the same qp. Otherwise, DeepEP would hang.

yanminjia avatar Mar 26 '25 00:03 yanminjia

We tested an early version of DeepEP (without atomic operations) on a RoCE network. When using the default ECMP routing, we also observed huge performance degradation. However, after switching to static routing, the performance became nearly identical to that of InfiniBand.

We can test this case with the 4 GPU servers within the same L2 network block if the environment is available. So the traffic will not be ECMP routed. Based on my understanding, the RDMA communication in DeepEP is restricted the GPUs with the same order on different GPU servers.

@sphish we tested 4 H100 servers located in the same block over RoCE network, in this case, the traffic reaches the target GPUs by going through layer 2 network instead of being routed to the spine switches with ECMA mechanism. But the performance is still bad by comparing to IB network.

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 8, RDMA chunk 16: 37.93 GB/s (RDMA), 76.50 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 20, RDMA chunk 8: 40.07 GB/s (RDMA), 80.82 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 12: 41.14 GB/s (RDMA), 82.98 GB/s (NVL)

Therefore, based on my understanding, in essence, dispatch or combine of DeepEP is equivalence of a AlltoAll communication. The congestion has significant impact on the DeepEP performance. From our observation, basically, the performance of DeepEP is consistent to the performance of AlltoAll over RoCE network. For example, if we mitigate the congestion by tuning the parameters of DCQCN, the performance of DeepEP and AlltoAll will become better.

yanminjia avatar Mar 26 '25 01:03 yanminjia

We tested an early version of DeepEP (without atomic operations) on a RoCE network. When using the default ECMP routing, we also observed huge performance degradation. However, after switching to static routing, the performance became nearly identical to that of InfiniBand.我们在 RoCE 网络上测试了早期版本的 DeepEP(没有原子操作)。使用默认的 ECMP 路由时,我们也观察到了巨大的性能下降。然而,在切换到静态路由后,性能几乎与 InfiniBand 相同。

We can test this case with the 4 GPU servers within the same L2 network block if the environment is available. So the traffic will not be ECMP routed. Based on my understanding, the RDMA communication in DeepEP is restricted the GPUs with the same order on different GPU servers.如果环境允许,我们可以在同一 L2 网络块内的 4 台 GPU 服务器上测试这种情况。这样流量就不会通过 ECMP 路由。根据我的理解,DeepEP 中的 RDMA 通信仅限于不同 GPU 服务器上具有相同顺序的 GPU。

@sphish we tested 4 H100 servers located in the same block over RoCE network, in this case, the traffic reaches the target GPUs by going through layer 2 network instead of being routed to the spine switches with ECMA mechanism. But the performance is still bad by comparing to IB network.我们在同一个机柜中测试了 4 台 H100 服务器,通过 RoCE 网络,流量会经过第 2 层网络直接到达目标 GPU,而不是通过 ECMA 机制路由到骨干交换机。但是,与 IB 网络相比,性能仍然不佳。

[tuning] Best dispatch (FP8): SMs 24, NVL chunk 8, RDMA chunk 16: 37.93 GB/s (RDMA), 76.50 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 20, RDMA chunk 8: 40.07 GB/s (RDMA), 80.82 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 12: 41.14 GB/s (RDMA), 82.98 GB/s (NVL)

Therefore, based on my understanding, in essence, dispatch or combine of DeepEP is equivalence of a AlltoAll communication. The congestion has significant impact on the DeepEP performance. From our observation, basically, the performance of DeepEP is consistent to the performance of AlltoAll over RoCE network. For example, if we mitigate the congestion by tuning the parameters of DCQCN, the performance of DeepEP and AlltoAll will become better.因此,根据我的理解,本质上,DeepEP 的分发或组合相当于 AlltoAll 通信。拥塞对 DeepEP 的性能有显著影响。从我们的观察来看,DeepEP 的性能基本上与 RoCE 网络上的 AlltoAll 通信性能一致。例如,如果我们通过调整 DCQCN 参数来缓解拥塞,DeepEP 和 AlltoAll 的性能都会得到改善。

Have you observed any congestion in your NIC counters?

Additionally, I suggest you try this version(e995aa22db1f5d204e5841cc8e73c670c772494b) to see if atomic operations are causing the performance degradation.

sphish avatar Mar 26 '25 01:03 sphish

Have you observed any congestion in your NIC counters?

Yes, indeed. If multiple senders transfer data to a single receiver at the same time, on the leaf switch, congestion will happen on the egress port connecting to the receiver in case of RoCE network.

Additionally, I suggest you try this version(e995aa2) to see if atomic operations are causing the performance degradation.

The latest code is used.

yanminjia avatar Mar 26 '25 02:03 yanminjia

Have you observed any congestion in your NIC counters?

Yes, indeed. If multiple senders transfer data to a single receiver at the same time, on the leaf switch, congestion will happen on the egress port connecting to the receiver in case of RoCE network.

I see. This does indeed differ from what we have observed.

Additionally, I suggest you try this version(e995aa2) to see if atomic operations are causing the performance degradation.

The latest code is used.

I mean you can try the early version.

sphish avatar Mar 26 '25 02:03 sphish

Please note RoCE is pretty different from IB with respect to flow control. I'm quite surprised if you guys didn't see any traffic congestion when run DeepEP over RoCE. Basically, I have thought the DeepEP traffic sent to the experts is scheduled elaborately to prevent congestion. But I didn't discover this kind of mechanism by checking the DeepEP code. I believe it's the congestion that leads to the performance degradation based on our experiments. Possibly, we could figure out a smart traffic scheduling mechanism to fix this problem in case of RoCE network.

yanminjia avatar Mar 26 '25 05:03 yanminjia

Due to the short communication time per round in DeepEP, traffic scheduling is difficult to implement. Instead, we eliminate congestion by evenly spreading traffic. DeepEP incorporates multiple mechanisms to accomplish this.

We tested an early version of DeepEP (without atomic operations) on a RoCE network. When using the default ECMP routing, we also observed huge performance degradation. However, after switching to static routing, the performance became nearly identical to that of InfiniBand.

We have indeed tested this approach on RoCE networks without observing congestion, which may also be related to the PFC watermark configuration and the buffer size of the switches.

sphish avatar Mar 26 '25 06:03 sphish

Hi @sphish , by the way, do you have the test result on RoCE network? I'm a little bit confused why PFC is configured if no congestion caused by DeepEP traffic.

yanminjia avatar Mar 26 '25 07:03 yanminjia

Many thanks. Added you to my weChat contact list. :)

yanminjia avatar Mar 26 '25 07:03 yanminjia

@yanminjia @sphish We had a similar question when benchmarking DeepEP. May I ask if there have been any recent updates on this issue that you could share? Thanks.

Aleda avatar Apr 09 '25 04:04 Aleda