DeepEP [BUG] test_internode hangs and times out on repeated runs

When test_internode is run in a tight loop for stress testing, it randomly hangs and eventually times out. It seems to be related to TMA function.

Screenshot of the hang:

Environment

Commit Hash: The latest commit on main (e3908bf5bd0cc6265bcb225d15cd8c996d4759ef)
Hardware: Reproducible on both 2/4-node H20 and 2/4-node H200 setups. ConnectX-7 with dual port 200G. RoCEv2 with PFC.
Image: ngc 24.02
Driver / CUDA Version: <570.148.08 / 12.8>

Steps to Reproduce

I modified the test script for stress testing.

https://github.com/deepseek-ai/DeepEP/blob/e3908bf5bd0cc6265bcb225d15cd8c996d4759ef/tests/test_internode.py#L174-L176 Adding a return to skip performance test.

    if local_rank == 0: 
        print('', flush=True) 
    return

https://github.com/deepseek-ai/DeepEP/blob/e3908bf5bd0cc6265bcb225d15cd8c996d4759ef/tests/test_internode.py#L242-L245 Running continuously for multiple rounds.

   for i in range(10000):
        test_main(args, num_sms, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
        if local_rank == 0:
            print('', flush=True)

Moving torch.manual_seed(rank) inside the loop to ensure identical inputs for each iteration does not fix the issue. I changed 'assert' to 'print' to prevent interrupting the testing program, but it still hang then timeout. calc_diff and check_data will get inf or nan.

Last Known Good Commit: 7705f53 (The stress test passes reliably on this commit). This is the last commit before the internode TMA feature was merged. First Known Bad Commit: a2fa3b7 (The stress test fails on this commit). This is the first commit that introduces the internode TMA functionality.

Aug 18 '25 07:08 polarstormx

I have tried to reproduce this issue following your method, but I was not successful. Can you add my wechat(Sphizzz)?

Aug 18 '25 08:08 sphish

another error

[config] num_tokens=4096, hidden=7168, num_topk_groups=4, num_topk=8
[layout] Kernel performance: 0.045 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ...[ERROR] Expert 0 token count mismatch: expected 4105, got 4113
[ERROR] Expert 3 token count mismatch: expected 4144, got 4143
[ERROR] Expert 1 token count mismatch: expected 3923, got 3924
[ERROR] Expert 3 token count mismatch: expected 4134, got 4133
[testing] expert 3 mask: tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]], device='cuda:4')
[testing] expert 0 mask: tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]], device='cuda:0')
[testing] expert 3 mask: tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]], device='cuda:7')
[testing] expert 1 mask: tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]], device='cuda:5')
[ERROR] calc_diff exception: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Aug 18 '25 09:08 polarstormx

some other errors. (WIth my print modify)

Aug 18 '25 09:08 polarstormx