DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

When running the two-machine test_low_latency.py (EP16), there is a significant difference in the test results between two machine

Open nannaer opened this issue 6 months ago • 18 comments

When running the two-machine test_low_latency.py (EP16), there is a significant difference in the test results of dispatch and combine on the two machines. My version number is 9fe9021, and I haven't modified any code. As follows. I'm not sure if it's my machine's network that's malfunctioning. @sphish

Machine 1 8 GPU

Image

Machine 2 8 GPU

Image

Version

Image

nannaer avatar Jun 24 '25 16:06 nannaer

This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.

sphish avatar Jun 25 '25 01:06 sphish

We have also encountered this performance issue and identified that it seems from behavioral differences between IB and RoCE. After applying the following configurations, performance normalized.

mlxreg -d mlx5_0 --reg_name ROCE_ACCL --set "adaptive_routing_forced_en=0x1" -y

liuhe-spec avatar Jun 25 '25 02:06 liuhe-spec

I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control.

If possible, you can ask your cluster's network administrators to try the following actions:

  1. Enable adaptive routing
  2. Adjust the DCQCN parameters
  3. Disable congestion control

sphish avatar Jun 25 '25 05:06 sphish

I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control.

If possible, you can ask your cluster's network administrators to try the following actions:

  1. Enable adaptive routing
  2. Adjust the DCQCN parameters
  3. Disable congestion control

Thanks! I will try it!

nannaer avatar Jun 25 '25 05:06 nannaer

@nannaer I'm also experiencing this issue. Are there any updates ? Thanks!

polarstormx avatar Jul 02 '25 07:07 polarstormx

We encountered the same issue, but in the opposite direction: the dispatch bandwidth on the master node is low, while the combine bandwidth is high. The problem persists even after swapping the master and worker roles. The issue was resolved after adding dist.barrier(). Image Can you determine whether the issue stems from network problems or the inference code itself?

sunmac avatar Jul 03 '25 06:07 sunmac

I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control.

If possible, you can ask your cluster's network administrators to try the following actions:

  1. Enable adaptive routing
  2. Adjust the DCQCN parameters
  3. Disable congestion control

We encounter the same issue, but in our Roce network their is no ecn which means DCQCN does not work. We disable the adaptive routing in enviroment. Can you explain more details about the reason about the AR actions ? Thanks !

rubbberrabbit avatar Jul 03 '25 07:07 rubbberrabbit

We have also encountered this performance issue and identified that it seems from behavioral differences between IB and RoCE. After applying the following configurations, performance normalized.

mlxreg -d mlx5_0 --reg_name ROCE_ACCL --set "adaptive_routing_forced_en=0x1" -y

Does that mean you indeed encounter congestion while running dispatch and combine than get that situation resolved through enabling AR?

Enjia avatar Jul 03 '25 07:07 Enjia

This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.

I have profiled and found below truth:

  1. nsys report or chrome json report showed that dispatch and combine kernel took time mismatch seriously
  2. it is easy to sync inside intranode, so if all GPUs on one host launch dispatch earlier than another host, it will send all tokens and then wait
  3. the host launch dispatch slowly doesn't need to wait during recv stage, so it will both send and recv process faster than earlier host
  4. then the next combine kernel changed the role, slowly dispatch will wait in next combine kernel recv stage, but earlier dispatch will launch combine slowly then it will run fast due to no need wait
  5. after repeated several times, one host always dispatch fast and combine slowly, and another host dispatch slowly and combine fast

finally, dispatch and combine will have huge performance gap !

elevenxiang avatar Jul 03 '25 09:07 elevenxiang

I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control. If possible, you can ask your cluster's network administrators to try the following actions:

  1. Enable adaptive routing
  2. Adjust the DCQCN parameters
  3. Disable congestion control

Thanks! I will try it!

Hi Manner,

Did you get any progress ? Thanks

elevenxiang avatar Jul 03 '25 11:07 elevenxiang

This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.

I have profiled and found below truth:

  1. nsys report or chrome json report showed that dispatch and combine kernel took time mismatch seriously
  2. it is easy to sync inside intranode, so if all GPUs on one host launch dispatch earlier than another host, it will send all tokens and then wait
  3. the host launch dispatch slowly doesn't need to wait during recv stage, so it will both send and recv process faster than earlier host
  4. then the next combine kernel changed the role, slowly dispatch will wait in next combine kernel recv stage, but earlier dispatch will launch combine slowly then it will run fast due to no need wait
  5. after repeated several times, one host always dispatch fast and combine slowly, and another host dispatch slowly and combine fast

finally, dispatch and combine will have huge performance gap !

You can try adding a dist.barrier before each alltoall operation

sphish avatar Jul 03 '25 13:07 sphish

This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.

I have profiled and found below truth:

  1. nsys report or chrome json report showed that dispatch and combine kernel took time mismatch seriously
  2. it is easy to sync inside intranode, so if all GPUs on one host launch dispatch earlier than another host, it will send all tokens and then wait
  3. the host launch dispatch slowly doesn't need to wait during recv stage, so it will both send and recv process faster than earlier host
  4. then the next combine kernel changed the role, slowly dispatch will wait in next combine kernel recv stage, but earlier dispatch will launch combine slowly then it will run fast due to no need wait
  5. after repeated several times, one host always dispatch fast and combine slowly, and another host dispatch slowly and combine fast

finally, dispatch and combine will have huge performance gap !

You can try adding a dist.barrier before each alltoall operation

Thanks! That works in pure deepEP dispatch+combine test, and I find the same send/recv time mismatching phenomenon in SGLang when it executing cross node inference because SGLang guarantees no barrier when calling dispatch or combine

Enjia avatar Jul 03 '25 14:07 Enjia

I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control.

If possible, you can ask your cluster's network administrators to try the following actions:

  1. Enable adaptive routing
  2. Adjust the DCQCN parameters
  3. Disable congestion control

@sphish Hi, I am facing the same issue (master node has low combine bw but high dispatch bw and vice versa for the other node) on an IB-based cluster. Do these actions apply to IB as well (IMO, IB does not have DCQCN and congestion control)?

Image

I've added group.barrier(), but it doesn't help. It is still far behind.

Image

Kevin-XiongC avatar Jul 30 '25 03:07 Kevin-XiongC

This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.

I have profiled and found below truth:

  1. nsys report or chrome json report showed that dispatch and combine kernel took time mismatch seriously
  2. it is easy to sync inside intranode, so if all GPUs on one host launch dispatch earlier than another host, it will send all tokens and then wait
  3. the host launch dispatch slowly doesn't need to wait during recv stage, so it will both send and recv process faster than earlier host
  4. then the next combine kernel changed the role, slowly dispatch will wait in next combine kernel recv stage, but earlier dispatch will launch combine slowly then it will run fast due to no need wait
  5. after repeated several times, one host always dispatch fast and combine slowly, and another host dispatch slowly and combine fast

finally, dispatch and combine will have huge performance gap !

You can try adding a dist.barrier before each alltoall operation

This works. I just tried it with my DeepEP setup and it stabilize bandwidth number in each rank. For anyone who is still having this problem :)

viralbhadeshiya avatar Oct 01 '25 19:10 viralbhadeshiya

@viralbhadeshiya def bench_kineto(fn, kernel_names: Union[str, tuple], num_tests: int = 30, suppress_kineto_output: bool = False, trace_path: Optional[str] = None, barrier_comm_profiling: bool = False, num_kernels_per_period: int = 1): # Profile suppress = suppress_stdout_stderr if suppress_kineto_output else empty_suppress with suppress(): schedule = torch.profiler.schedule(wait=1, warmup=0, active=1, repeat=1) with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA], schedule=schedule) as prof: for i in range(2): # NOTES: use a large kernel and a barrier to eliminate the unbalanced CPU launch overhead if barrier_comm_profiling: lhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda') rhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda') lhs @ rhs dist.all_reduce(torch.ones(1, dtype=torch.float, device='cuda')) for _ in range(num_tests): fn() torch.cuda.synchronize() prof.step() I'm encountering the same situation here, but upon reviewing the code, I see that synchronisation has already been implemented prior to executing this function. Why then is synchronisation being added again before test_func?

kzlxd avatar Nov 02 '25 03:11 kzlxd

@viralbhadeshiya def bench_kineto(fn, kernel_names: Union[str, tuple], num_tests: int = 30, suppress_kineto_output: bool = False, trace_path: Optional[str] = None, barrier_comm_profiling: bool = False, num_kernels_per_period: int = 1): # Profile suppress = suppress_stdout_stderr if suppress_kineto_output else empty_suppress with suppress(): schedule = torch.profiler.schedule(wait=1, warmup=0, active=1, repeat=1) with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA], schedule=schedule) as prof: for i in range(2): # NOTES: use a large kernel and a barrier to eliminate the unbalanced CPU launch overhead if barrier_comm_profiling: lhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda') rhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda') lhs @ rhs dist.all_reduce(torch.ones(1, dtype=torch.float, device='cuda')) for _ in range(num_tests): fn() torch.cuda.synchronize() prof.step() I'm encountering the same situation here, but upon reviewing the code, I see that synchronisation has already been implemented prior to executing this function. Why then is synchronisation being added again before test_func?

@kzlxd The synchronization inside bench_kineto() ensures that GPU work launched within the profiled window has completed before advancing the profiler step it only aligns CUDA kernel completion with the profiler timeline. However, it does not guarantee inter-rank phase alignment. In test_low_latency.py, each rank can enter the dispatch or combine phase at slightly different times due to async collectives and kernel scheduling. This causes one rank’s dispatch window to overlap with the other’s combine phase, leading to asymmetric per-phase bandwidths even though total throughput is correct.

viralbhadeshiya avatar Nov 04 '25 23:11 viralbhadeshiya

@elevenxiang how to get the profile of the deepep, I passed in trace_path = f‘{rank}fp8_dispatch_combine.json’, but received the error: RuntimeError: Trace is already saved.

kzlxd avatar Nov 11 '25 14:11 kzlxd

@viralbhadeshiya thanks for your explaining

kzlxd avatar Nov 11 '25 14:11 kzlxd