When running the two-machine test_low_latency.py (EP16), there is a significant difference in the test results between two machine
When running the two-machine test_low_latency.py (EP16), there is a significant difference in the test results of dispatch and combine on the two machines. My version number is 9fe9021, and I haven't modified any code. As follows. I'm not sure if it's my machine's network that's malfunctioning. @sphish
Machine 1 8 GPU
Machine 2 8 GPU
Version
This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.
We have also encountered this performance issue and identified that it seems from behavioral differences between IB and RoCE. After applying the following configurations, performance normalized.
mlxreg -d mlx5_0 --reg_name ROCE_ACCL --set "adaptive_routing_forced_en=0x1" -y
I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control.
If possible, you can ask your cluster's network administrators to try the following actions:
- Enable adaptive routing
- Adjust the DCQCN parameters
- Disable congestion control
I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control.
If possible, you can ask your cluster's network administrators to try the following actions:
- Enable adaptive routing
- Adjust the DCQCN parameters
- Disable congestion control
Thanks! I will try it!
@nannaer I'm also experiencing this issue. Are there any updates ? Thanks!
We encountered the same issue, but in the opposite direction: the dispatch bandwidth on the master node is low, while the combine bandwidth is high. The problem persists even after swapping the master and worker roles.
The issue was resolved after adding dist.barrier().
Can you determine whether the issue stems from network problems or the inference code itself?
I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control.
If possible, you can ask your cluster's network administrators to try the following actions:
- Enable adaptive routing
- Adjust the DCQCN parameters
- Disable congestion control
We encounter the same issue, but in our Roce network their is no ecn which means DCQCN does not work. We disable the adaptive routing in enviroment. Can you explain more details about the reason about the AR actions ? Thanks !
We have also encountered this performance issue and identified that it seems from behavioral differences between IB and RoCE. After applying the following configurations, performance normalized.
mlxreg -d mlx5_0 --reg_name ROCE_ACCL --set "adaptive_routing_forced_en=0x1" -y
Does that mean you indeed encounter congestion while running dispatch and combine than get that situation resolved through enabling AR?
This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.
I have profiled and found below truth:
- nsys report or chrome json report showed that dispatch and combine kernel took time mismatch seriously
- it is easy to sync inside intranode, so if all GPUs on one host launch dispatch earlier than another host, it will send all tokens and then wait
- the host launch dispatch slowly doesn't need to wait during recv stage, so it will both send and recv process faster than earlier host
- then the next combine kernel changed the role, slowly dispatch will wait in next combine kernel recv stage, but earlier dispatch will launch combine slowly then it will run fast due to no need wait
- after repeated several times, one host always dispatch fast and combine slowly, and another host dispatch slowly and combine fast
finally, dispatch and combine will have huge performance gap !
I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control. If possible, you can ask your cluster's network administrators to try the following actions:
- Enable adaptive routing
- Adjust the DCQCN parameters
- Disable congestion control
Thanks! I will try it!
Hi Manner,
Did you get any progress ? Thanks
This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.
I have profiled and found below truth:
- nsys report or chrome json report showed that dispatch and combine kernel took time mismatch seriously
- it is easy to sync inside intranode, so if all GPUs on one host launch dispatch earlier than another host, it will send all tokens and then wait
- the host launch dispatch slowly doesn't need to wait during recv stage, so it will both send and recv process faster than earlier host
- then the next combine kernel changed the role, slowly dispatch will wait in next combine kernel recv stage, but earlier dispatch will launch combine slowly then it will run fast due to no need wait
- after repeated several times, one host always dispatch fast and combine slowly, and another host dispatch slowly and combine fast
finally, dispatch and combine will have huge performance gap !
You can try adding a dist.barrier before each alltoall operation
This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.
I have profiled and found below truth:
- nsys report or chrome json report showed that dispatch and combine kernel took time mismatch seriously
- it is easy to sync inside intranode, so if all GPUs on one host launch dispatch earlier than another host, it will send all tokens and then wait
- the host launch dispatch slowly doesn't need to wait during recv stage, so it will both send and recv process faster than earlier host
- then the next combine kernel changed the role, slowly dispatch will wait in next combine kernel recv stage, but earlier dispatch will launch combine slowly then it will run fast due to no need wait
- after repeated several times, one host always dispatch fast and combine slowly, and another host dispatch slowly and combine fast
finally, dispatch and combine will have huge performance gap !
You can try adding a dist.barrier before each alltoall operation
Thanks! That works in pure deepEP dispatch+combine test, and I find the same send/recv time mismatching phenomenon in SGLang when it executing cross node inference because SGLang guarantees no barrier when calling dispatch or combine
I discussed this issue with @liuhe-spec on WeChat, and we strongly suspect it is likely related to RoCE network congestion control.
If possible, you can ask your cluster's network administrators to try the following actions:
- Enable adaptive routing
- Adjust the DCQCN parameters
- Disable congestion control
@sphish Hi, I am facing the same issue (master node has low combine bw but high dispatch bw and vice versa for the other node) on an IB-based cluster. Do these actions apply to IB as well (IMO, IB does not have DCQCN and congestion control)?
I've added group.barrier(), but it doesn't help. It is still far behind.
This seems very similar to #183. We haven’t encountered this problem, but my guess is that it might be caused by misaligned kernel launch times. You can try generating a timeline using the torch profiler to take a closer look.
I have profiled and found below truth:
- nsys report or chrome json report showed that dispatch and combine kernel took time mismatch seriously
- it is easy to sync inside intranode, so if all GPUs on one host launch dispatch earlier than another host, it will send all tokens and then wait
- the host launch dispatch slowly doesn't need to wait during recv stage, so it will both send and recv process faster than earlier host
- then the next combine kernel changed the role, slowly dispatch will wait in next combine kernel recv stage, but earlier dispatch will launch combine slowly then it will run fast due to no need wait
- after repeated several times, one host always dispatch fast and combine slowly, and another host dispatch slowly and combine fast
finally, dispatch and combine will have huge performance gap !
You can try adding a dist.barrier before each alltoall operation
This works. I just tried it with my DeepEP setup and it stabilize bandwidth number in each rank. For anyone who is still having this problem :)
@viralbhadeshiya
def bench_kineto(fn, kernel_names: Union[str, tuple], num_tests: int = 30, suppress_kineto_output: bool = False,
trace_path: Optional[str] = None, barrier_comm_profiling: bool = False,
num_kernels_per_period: int = 1):
# Profile
suppress = suppress_stdout_stderr if suppress_kineto_output else empty_suppress
with suppress():
schedule = torch.profiler.schedule(wait=1, warmup=0, active=1, repeat=1)
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA], schedule=schedule) as prof:
for i in range(2):
# NOTES: use a large kernel and a barrier to eliminate the unbalanced CPU launch overhead
if barrier_comm_profiling:
lhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda')
rhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda')
lhs @ rhs
dist.all_reduce(torch.ones(1, dtype=torch.float, device='cuda'))
for _ in range(num_tests):
fn()
torch.cuda.synchronize()
prof.step()
I'm encountering the same situation here, but upon reviewing the code, I see that synchronisation has already been implemented prior to executing this function. Why then is synchronisation being added again before test_func?
@viralbhadeshiya def bench_kineto(fn, kernel_names: Union[str, tuple], num_tests: int = 30, suppress_kineto_output: bool = False, trace_path: Optional[str] = None, barrier_comm_profiling: bool = False, num_kernels_per_period: int = 1): # Profile suppress = suppress_stdout_stderr if suppress_kineto_output else empty_suppress with suppress(): schedule = torch.profiler.schedule(wait=1, warmup=0, active=1, repeat=1) with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA], schedule=schedule) as prof: for i in range(2): # NOTES: use a large kernel and a barrier to eliminate the unbalanced CPU launch overhead if barrier_comm_profiling: lhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda') rhs = torch.randn((8192, 8192), dtype=torch.float, device='cuda') lhs @ rhs dist.all_reduce(torch.ones(1, dtype=torch.float, device='cuda')) for _ in range(num_tests): fn() torch.cuda.synchronize() prof.step() I'm encountering the same situation here, but upon reviewing the code, I see that synchronisation has already been implemented prior to executing this function. Why then is synchronisation being added again before
test_func?
@kzlxd The synchronization inside bench_kineto() ensures that GPU work launched within the profiled window has completed before advancing the profiler step it only aligns CUDA kernel completion with the profiler timeline. However, it does not guarantee inter-rank phase alignment. In test_low_latency.py, each rank can enter the dispatch or combine phase at slightly different times due to async collectives and kernel scheduling. This causes one rank’s dispatch window to overlap with the other’s combine phase, leading to asymmetric per-phase bandwidths even though total throughput is correct.
@elevenxiang how to get the profile of the deepep, I passed in trace_path = f‘{rank}fp8_dispatch_combine.json’, but received the error: RuntimeError: Trace is already saved.
@viralbhadeshiya thanks for your explaining