UE8M0(PR206) features cause severe a regression issue and cause low-latency stuck
In recent SGLANG PD disaggregation integration tests, we found it 100% stuck in DeepEP dispatch-combine call. And the low_latency.py unit test stack looks as below when it's stuck:
__torch_function__ (torch/utils/_device.py:104)
calc_diff (utils.py:34)
test_main (test_low_latency.py:96)
test_loop (test_low_latency.py:170)
_wrap (torch/multiprocessing/spawn.py:90)
run (multiprocessing/process.py:108)
_bootstrap (multiprocessing/process.py:314)
_main (multiprocessing/spawn.py:129)
spawn_main (multiprocessing/spawn.py:116)
<module> (<string>:1)
With deeply regression tests for historical commits, I addressed this PR : https://github.com/deepseek-ai/DeepEP/pull/206
@shifangx would you please take a look at this or resolve it?
I will take a look at this.
any progress ?
I have not been able to reproduce the issue yet. This test case requires the use of IBGDA and needs the cluster administrator to help with the configuration. I have just contacted the cluster to urge the administrator to assist with the setup.
I am also experiencing the same hang issue. In my case, it occurs with a setup of 4 machines (H20*8), where 2 machines function normally. The program is also stuck in calc_diff
I use 6*H100 with IB network cards to have low-latency tests, it's 100% reproducible. This stops me using the latest version of DeepEP. I have to use the older DeepEP.
Would you revert this commit if it cannot be resolved in short term? @shifangx @zhyncs
I have tested on GB200 with 4 nodes, and each node has 4GPUs, test passed.
I can not reproduce this issue on GB200.
@jeffye-dev @polarstormx , Can you help to do some test on your system?
During testing, we will go through all possible value combinations of return_recv_hook, dispatch_use_fp8, round_scale, and use_ue8m0.
(1)Could you help record which combinations cause the tests to fail?
(2)Can you try to skip the test for round_scale==True?
in test_main(), skip test for round_scale, just as the following. and print return_recv_hook, dispatch_use_fp8, round_scale, use_ue8m0
@shifangx I have also tested on 4*4 GB200, and it runs successfully.
Here's my tests on 4*8 H20.
When set do_check= False or skip round_scale==True, the test can run successfully.
It hangs before: return_recv_hook: False, dispatch_use_fp8: True, round_scale: True, use_ue8m0: False
@polarstormx Thank you very much for your feedback.
I use 6*H100 with IB network cards to have low-latency tests, it's 100% reproducible. This stops me using the latest version of DeepEP. I have to use the older DeepEP.
Would you revert this commit if it cannot be resolved in short term? @shifangx @zhyncs
Hi, @jeffye-dev, This issue may only occur when round_scale is set to True. Maybe you can still use the latest version on H100, by making sure round_scale==False. If you still encounter issues after setting round_scale to False, please leave a comment here.
It seems to be the same problem in #209 . And the reporter mentioned that it may be caused by wrong envoironment variables.
I have also tested on 4*4 GB200, and it runs successfully.
@polarstormx , what about 8*4 GB200?
I have tested on GB200. It runs successfully on 2*4 GB200 ,4*4 GB200, 6*4 GB200, but hang on 8*4 GB200 and 12*4 GB200.
what about
8*4 GB200
@shifangx Unfortunately, I only have a 4x4 GB200 , so I can't test the larger configuration
what the root cause?
root cause: One rank fails at an assert and then exits. All the other ranks are waiting for that rank at some point, which causes the hang.
I ran experiments using 12*4 GB200s and found that after commenting out this assert, the program no longer hangs.
https://github.com/deepseek-ai/DeepEP/blob/c50f3d6fcd800154dc41288fbefe194f33eb59cb/tests/test_low_latency.py#L87-L100
@shifangx Thanks! The assert didn't print out correctly after it was triggered, which made the issue quite confusing. What is the theoretical lower bound for precision when using ue8m0 for scaling? We should probably relax the check here.
Thanks,when setting round_scale=False the issue is gone. close this issue then
@shifangx Thanks! The assert didn't print out correctly after it was triggered, which made the issue quite confusing. What is the theoretical lower bound for precision when using ue8m0 for scaling? We should probably relax the check here.
@polarstormx https://github.com/deepseek-ai/DeepEP/pull/292 The destructor of DeepEP may cause the Python exception handling process to hang. Using explicitly_destroy allows the assert statements to be printed out.