DeepEP UE8M0(PR206) features cause severe a regression issue and cause low-latency stuck

In recent SGLANG PD disaggregation integration tests, we found it 100% stuck in DeepEP dispatch-combine call. And the low_latency.py unit test stack looks as below when it's stuck:

__torch_function__ (torch/utils/_device.py:104)
calc_diff (utils.py:34)
test_main (test_low_latency.py:96)
test_loop (test_low_latency.py:170)
_wrap (torch/multiprocessing/spawn.py:90)
run (multiprocessing/process.py:108)
_bootstrap (multiprocessing/process.py:314)
_main (multiprocessing/spawn.py:129)
spawn_main (multiprocessing/spawn.py:116)
<module> (<string>:1)

With deeply regression tests for historical commits, I addressed this PR : https://github.com/deepseek-ai/DeepEP/pull/206

@shifangx would you please take a look at this or resolve it?

Jun 21 '25 01:06 jeffye-dev

I will take a look at this.

Jun 21 '25 01:06 shifangx

any progress ?

Jun 25 '25 02:06 jeffye-dev

I have not been able to reproduce the issue yet. This test case requires the use of IBGDA and needs the cluster administrator to help with the configuration. I have just contacted the cluster to urge the administrator to assist with the setup.

Jun 26 '25 08:06 shifangx

I am also experiencing the same hang issue. In my case, it occurs with a setup of 4 machines (H20*8), where 2 machines function normally. The program is also stuck in calc_diff

Jun 27 '25 08:06 polarstormx

I use 6*H100 with IB network cards to have low-latency tests, it's 100% reproducible. This stops me using the latest version of DeepEP. I have to use the older DeepEP.

Would you revert this commit if it cannot be resolved in short term? @shifangx @zhyncs

Jul 02 '25 04:07 jeffye-dev

I have tested on GB200 with 4 nodes, and each node has 4GPUs, test passed. I can not reproduce this issue on GB200. @jeffye-dev @polarstormx , Can you help to do some test on your system? During testing, we will go through all possible value combinations of return_recv_hook, dispatch_use_fp8, round_scale, and use_ue8m0. （1）Could you help record which combinations cause the tests to fail? （2）Can you try to skip the test for round_scale==True? in test_main(), skip test for round_scale, just as the following. and print return_recv_hook, dispatch_use_fp8, round_scale, use_ue8m0

Jul 03 '25 01:07 shifangx

@shifangx I have also tested on 4*4 GB200, and it runs successfully.

Here's my tests on 4*8 H20.

When set do_check= False or skip round_scale==True, the test can run successfully.

It hangs before: return_recv_hook: False, dispatch_use_fp8: True, round_scale: True, use_ue8m0: False

Jul 03 '25 02:07 polarstormx

@polarstormx Thank you very much for your feedback.

Jul 03 '25 02:07 shifangx

I use 6*H100 with IB network cards to have low-latency tests, it's 100% reproducible. This stops me using the latest version of DeepEP. I have to use the older DeepEP.

Would you revert this commit if it cannot be resolved in short term? @shifangx @zhyncs

Hi, @jeffye-dev, This issue may only occur when round_scale is set to True. Maybe you can still use the latest version on H100, by making sure round_scale==False. If you still encounter issues after setting round_scale to False, please leave a comment here.

Jul 03 '25 02:07 shifangx

It seems to be the same problem in #209 . And the reporter mentioned that it may be caused by wrong envoironment variables.

Jul 03 '25 03:07 0x777a6c

I have also tested on 4*4 GB200, and it runs successfully.

@polarstormx , what about 8*4 GB200? I have tested on GB200. It runs successfully on 2*4 GB200 ,4*4 GB200, 6*4 GB200, but hang on 8*4 GB200 and 12*4 GB200.

Jul 05 '25 02:07 shifangx

what about 8*4 GB200

@shifangx Unfortunately, I only have a 4x4 GB200 , so I can't test the larger configuration

Jul 07 '25 01:07 polarstormx

what the root cause？

Jul 08 '25 07:07 kzlxd

root cause: One rank fails at an assert and then exits. All the other ranks are waiting for that rank at some point, which causes the hang.

I ran experiments using 12*4 GB200s and found that after commenting out this assert, the program no longer hangs.

https://github.com/deepseek-ai/DeepEP/blob/c50f3d6fcd800154dc41288fbefe194f33eb59cb/tests/test_low_latency.py#L87-L100

Jul 08 '25 10:07 shifangx

@shifangx Thanks! The assert didn't print out correctly after it was triggered, which made the issue quite confusing. What is the theoretical lower bound for precision when using ue8m0 for scaling? We should probably relax the check here.

Jul 08 '25 11:07 polarstormx

Thanks，when setting round_scale=False the issue is gone. close this issue then

Jul 14 '25 12:07 jeffye-dev

@shifangx Thanks! The assert didn't print out correctly after it was triggered, which made the issue quite confusing. What is the theoretical lower bound for precision when using ue8m0 for scaling? We should probably relax the check here.

@polarstormx https://github.com/deepseek-ai/DeepEP/pull/292 The destructor of DeepEP may cause the Python exception handling process to hang. Using explicitly_destroy allows the assert statements to be printed out.

Jul 15 '25 02:07 sphish