What circumstances should low_latency_mode be set to false
When initializing the buffer, I need to pass in low_latency_mode. If it is true, low_latency mode is enabled. However, I found that in almost all cases, the performance of setting it to true is better than setting it to false. because we can only use ranks'0 IB when this parameter is false
I would like to ask under what circumstances should this parameter be set to false?
@LyricZhao Thank You!
I would like to ask under what circumstances should this parameter be set to false?
For training and inference prefill, this parameter is False to save GPU memory.
However, I found that in almost all cases, the performance of setting it to true is better than setting it to false.
It is not expected, they should be same performance, can you please share more information (GPU, EP size, normal or low-latency kernel)?
During the prefill phase of inference, if this parameter is set to False, all traffic will be forwarded through IB NIC 0, which will cause a bandwidth bottleneck for the NIC and affect end-to-end throughput.
If it is set to true, the traffic between the NICs is balanced, so that I can get a larger bandwidth and the overall throughput will be higher.
My experimental environment is 48A800, each node is equipped with 4 IB network cards
@LyricZhao
For the 8-GPU 4-NIC configuration, NVSHMEM sometimes cannot handle this topology correctly. You may need to manually set NVSHMEM_HCA_LIST and NVSHMEM_HCA_PE_MAPPING to map each GPU to the appropriate NIC.
@sphish Thanks for Ur Reply,
I had setted these two environment variables before raising this issue, but there is still an imbalance in the network card traffic in normal mode.
However, after I set low_latency_mode to true, the NIC traffic becomes balanced and a higher throughput can be achieved.
So I have some doubts here.
When you set low_latency_mode = False, each rank requires a different environment variable setting. It is recommended to also set NVSHMEM_DEBUG=INFO to verify that your GPU is selecting the correct network interface.
I see, it means that when low_latency_mode = False, I need to set different 'NVSHMEM_HCA_LIST' for each GPU rank ?
if I do like this, will the IB NICs become balanced ? @sphish thanks!
I think so.
@pc-neo
you can show your GPU and NIC topo, command: nvidia-smi topo -m.
you can show all env you configed.
@sphish I'm also facing this same challenge when running the internode tests with 2 nodes with 8 x H200. Each node has 8 400GB IB NICs. The scenarios that work with low_latency_mode = False are:
- When
NVSHMEM_HCA_PE_MAPPING=1, then all ranks use device_id 0 (mlx5_0) - very poor performance (6 GB/s RDMA perf) - When
NVSHMEM_HCA_PE_MAPPING=0, NVSHMEM_IBGDA_ENABLE_MULTI_PORT=1,then each rank uses all devices which is not optimal - slightly better (10 GB/s RDMA perf)
In comparison, when testing internode test with --test-ll-compatibility, the HCA_PE_MAPPING=1 works correctly with NVSHMEM_HCA_PE_MAPPING="mlx5_0:1:1,mlx5_1:1:1,mlx5_2:1:1,mlx5_3:1:1,mlx5_4:1:1,mlx5_5:1:1,mlx5_6:1:1,mlx5_7:1:1" and the performance is +50 GB/s.
Ask Can you explain how I should set the NVSHMEM_HCA_LIST or NVSHMEM_HCA_PE_MAPPING for each GPU rank separately when not using the low-latency options? Does this mean modifying the source code on how the devices are allocated?