DeepEP What circumstances should low_latency

When initializing the buffer, I need to pass in low_latency_mode. If it is true, low_latency mode is enabled. However, I found that in almost all cases, the performance of setting it to true is better than setting it to false. because we can only use ranks'0 IB when this parameter is false

I would like to ask under what circumstances should this parameter be set to false?

@LyricZhao Thank You!

Jun 18 '25 06:06 SiqiLi-Fighting

I would like to ask under what circumstances should this parameter be set to false?

For training and inference prefill, this parameter is False to save GPU memory.

However, I found that in almost all cases, the performance of setting it to true is better than setting it to false.

It is not expected, they should be same performance, can you please share more information (GPU, EP size, normal or low-latency kernel)?

Jun 18 '25 07:06 LyricZhao

During the prefill phase of inference, if this parameter is set to False, all traffic will be forwarded through IB NIC 0, which will cause a bandwidth bottleneck for the NIC and affect end-to-end throughput.

If it is set to true, the traffic between the NICs is balanced, so that I can get a larger bandwidth and the overall throughput will be higher.

My experimental environment is 48A800, each node is equipped with 4 IB network cards

@LyricZhao

Jun 18 '25 07:06 SiqiLi-Fighting

For the 8-GPU 4-NIC configuration, NVSHMEM sometimes cannot handle this topology correctly. You may need to manually set NVSHMEM_HCA_LIST and NVSHMEM_HCA_PE_MAPPING to map each GPU to the appropriate NIC.

Jun 18 '25 07:06 sphish

@sphish Thanks for Ur Reply,
I had setted these two environment variables before raising this issue, but there is still an imbalance in the network card traffic in normal mode. However, after I set low_latency_mode to true, the NIC traffic becomes balanced and a higher throughput can be achieved. So I have some doubts here.

Jun 18 '25 08:06 SiqiLi-Fighting

When you set low_latency_mode = False, each rank requires a different environment variable setting. It is recommended to also set NVSHMEM_DEBUG=INFO to verify that your GPU is selecting the correct network interface.

Jun 18 '25 09:06 sphish

I see, it means that when low_latency_mode = False, I need to set different 'NVSHMEM_HCA_LIST' for each GPU rank ?

if I do like this, will the IB NICs become balanced ? @sphish thanks!

Jun 18 '25 11:06 SiqiLi-Fighting

I think so.

Jun 19 '25 01:06 sphish

@pc-neo you can show your GPU and NIC topo, command: nvidia-smi topo -m.

you can show all env you configed.

Jun 19 '25 03:06 alpha-baby

@sphish I'm also facing this same challenge when running the internode tests with 2 nodes with 8 x H200. Each node has 8 400GB IB NICs. The scenarios that work with low_latency_mode = False are:

When NVSHMEM_HCA_PE_MAPPING=1, then all ranks use device_id 0 (mlx5_0) - very poor performance (6 GB/s RDMA perf)
When NVSHMEM_HCA_PE_MAPPING=0, NVSHMEM_IBGDA_ENABLE_MULTI_PORT=1, then each rank uses all devices which is not optimal - slightly better (10 GB/s RDMA perf)

In comparison, when testing internode test with --test-ll-compatibility, the HCA_PE_MAPPING=1 works correctly with NVSHMEM_HCA_PE_MAPPING="mlx5_0:1:1,mlx5_1:1:1,mlx5_2:1:1,mlx5_3:1:1,mlx5_4:1:1,mlx5_5:1:1,mlx5_6:1:1,mlx5_7:1:1" and the performance is +50 GB/s.

Ask Can you explain how I should set the NVSHMEM_HCA_LIST or NVSHMEM_HCA_PE_MAPPING for each GPU rank separately when not using the low-latency options? Does this mean modifying the source code on how the devices are allocated?

Oct 01 '25 13:10 Tiktus

What circumstances should low_latency_mode be set to false