DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

[test_internode.py] failed on multi-QP: dispatch timeout on ROCE network with testing 2*H20 nodes

Open jeffye-dev opened this issue 8 months ago • 39 comments

When I run the across-node test with MASTER_ADDR=<ip> MASTER_PORT=30001 WORLD_SIZE=2 RANK=0 python test_internode.py on 2*H20 nodes, I got the following timeout log:

DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 0, nvl: 4, src RDMA lane: 1, dst NVL: 2, meta: 0, 0, 0, 0

terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f718176c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f71817166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7181b73a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f7181b3a92e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f7181b3ba57 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f7181b3bc5f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f718059af70 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f718174d69f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f718174637b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7181746529 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f7180861a98 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f7180861de6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181758 (0x5570898a4758 in /usr/bin/python)
frame #13: <unknown function> + 0x1949e8 (0x5570898b79e8 in /usr/bin/python)
frame #14: <unknown function> + 0x1949fc (0x5570898b79fc in /usr/bin/python)
frame #15: <unknown function> + 0x1949fc (0x5570898b79fc in /usr/bin/python)
frame #16: <unknown function> + 0x1a08bf (0x5570898c38bf in /usr/bin/python)
frame #17: <unknown function> + 0x15f9d6 (0x5570898829d6 in /usr/bin/python)
frame #18: <unknown function> + 0x2941a7 (0x5570899b71a7 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x5757 (0x55708989da27 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x818 (0x557089898ae8 in /usr/bin/python)
frame #22: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x6d2 (0x5570898989a2 in /usr/bin/python)
frame #24: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x1a22 (0x557089899cf2 in /usr/bin/python)
frame #26: <unknown function> + 0x25ae56 (0x55708997de56 in /usr/bin/python)
frame #27: PyEval_EvalCode + 0x86 (0x55708997dd26 in /usr/bin/python)
frame #28: <unknown function> + 0x281ae8 (0x5570899a4ae8 in /usr/bin/python)
frame #29: <unknown function> + 0x27c2ef (0x55708999f2ef in /usr/bin/python)
frame #30: PyRun_StringFlags + 0x81 (0x557089998f61 in /usr/bin/python)
frame #31: PyRun_SimpleStringFlags + 0x41 (0x557089998e11 in /usr/bin/python)
frame #32: Py_RunMain + 0x3d0 (0x557089998140 in /usr/bin/python)
frame #33: Py_BytesMain + 0x2d (0x557089971d6d in /usr/bin/python)
frame #34: <unknown function> + 0x29d90 (0x7f7182671d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: __libc_start_main + 0x80 (0x7f7182671e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x25 (0x557089971c65 in /usr/bin/python)

This issue only happens after the Multi-QP patch: https://github.com/deepseek-ai/DeepEP/commit/5ab80c28f3d6c3e4f88ce236f427ab7c81025172 is merged. It's probably related with multi-QP.

jeffye-dev avatar Apr 27 '25 06:04 jeffye-dev

Is adaptive routing enabled in your NIC and switch configuration?

sphish avatar Apr 27 '25 07:04 sphish

This change has been successfully tested in our own IB environment and in RoCE environments of several cloud service providers. However, I have indeed discovered that it fails in some RoCE environments where adaptive routing is enabled.

sphish avatar Apr 27 '25 07:04 sphish

The AR is OFF in my environment. Is it caused by multi-QP. I tried earlier version and find it's working.

jeffye-dev avatar Apr 27 '25 07:04 jeffye-dev

@jeffye-dev try set this var to True https://github.com/deepseek-ai/DeepEP/commit/5ab80c28f3d6c3e4f88ce236f427ab7c81025172#diff-c77f4e0d77d8fc685ab907f9ad338f0c168b96ad4313c77b6dff9c7faf0713b9R224

alpha-baby avatar Apr 28 '25 06:04 alpha-baby

https://github.com/deepseek-ai/DeepEP/commit/007fcfcf97914e1f3d661f28dd125e7d1b9f8320#diff-c77f4e0d77d8fc685ab907f9ad338f0c168b96ad4313c77b6dff9c7faf0713b9R222

in the latest version, not allow set test_ll_compatibility=False? @sphish

Because test will be failed when set test_ll_compatibility=False. test will be success when set test_ll_compatibility=True`.

failed log:

/root/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 0 nranks 2 tag 0 - ENTER
/root/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 0 nranks 2 tag 1 - DONE
/root/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 0 nranks 2 tag 0 - ENTER
/root/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 0 nranks 2 tag 1 - DONE
[config] num_tokens=4096, hidden=7168, num_topk_groups=2, num_topk=8
[layout] Kernel performance: 0.050 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 4.11 GB/s (RDMA), 13.47 GB/s (NVL)
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 2, meta: 0, 0, 0, 0
...................

alpha-baby avatar Apr 28 '25 07:04 alpha-baby

@alpha-baby Is this issue occurring intermittently, or does it happen every time you run the test? I haven't encountered this problem in our own testing environment. Have you recompiled the C code?

sphish avatar Apr 28 '25 07:04 sphish

@alpha-baby Is this issue occurring intermittently, or does it happen every time you run the test? I haven't encountered this problem in our own testing environment. Have you recompiled the C code?

I use this commit code: https://github.com/deepseek-ai/DeepEP/commit/007fcfcf97914e1f3d661f28dd125e7d1b9f8320#diff-c77f4e0d77d8fc685ab907f9ad338f0c168b96ad4313c77b6dff9c7faf0713b9R222

I just modified the variable test_ll_compatibility=False in the test_internode.py file, and didn't recompile the C code, which can always be reproduced in my environment. My test environment uses ROCE. @sphish

alpha-baby avatar Apr 28 '25 08:04 alpha-baby

@alpha-baby What I meant is, after switching to this commit, did you recompile the C code? If you haven't done so, you should recompile the C code.

sphish avatar Apr 28 '25 08:04 sphish

@alpha-baby What I meant is, after switching to this commit, did you recompile the C code? If you haven't done so, you should recompile the C code.

yes, i recompiled the C code. I found that using the new commit really improved the performance.

alpha-baby avatar Apr 28 '25 08:04 alpha-baby

@alpha-baby It's quite strange. I can't reproduce this issue. Theoretically, setting test_ll_compatibility=False now only changes the way NVSHMEM init group, and shouldn't affect correctness.

sphish avatar Apr 28 '25 09:04 sphish

I encountered this issue on 2*H800, and it occurs whether test_ll_compatibility is set to True or False. It may happen occasionally, so you can try multiple times to reproduce this issue.

Qizhi697 avatar Apr 29 '25 02:04 Qizhi697

the same error after upgrade to multi-qp relation code

whybeyoung avatar Apr 30 '25 15:04 whybeyoung

The AR is OFF in my environment. Is it caused by multi-QP. I tried earlier version and find it's working.

the same

whybeyoung avatar Apr 30 '25 15:04 whybeyoung

log

### RANK0
tests/test_internode.py 
[config] num_tokens=4096, hidden=7168, num_topk_groups=2, num_topk=8
[layout] Kernel performance: 0.075 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 6.81 GB/s (RDMA), 22.29 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 9.22 GB/s (RDMA), 30.19 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 15.80 GB/s (RDMA), 51.73 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 20.66 GB/s (RDMA), 67.66 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 21.80 GB/s (RDMA), 71.38 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 22.44 GB/s (RDMA), 73.50 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 23.12 GB/s (RDMA), 75.72 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 22.81 GB/s (RDMA), 74.70 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 7.25 GB/s (RDMA), 23.75 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 8.93 GB/s (RDMA), 29.25 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 13.00 GB/s (RDMA), 42.56 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 19.47 GB/s (RDMA), 63.75 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 21.95 GB/s (RDMA), 71.87 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 22.55 GB/s (RDMA), 73.85 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 23.15 GB/s (RDMA), 75.79 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 23.35 GB/s (RDMA), 76.47 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 7.62 GB/s (RDMA), 24.96 GB/s (NVL) 
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 0, nvl: 7, src NVL: 2, head: 255, tail: 255
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 0, nvl: 1, src NVL: 2, head: 280, tail: 280
DeepEP timeout check failed: 0 (rank = 3)
DeepEP timeout check failed: 0 (rank = 4)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5f3b56c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5f3b5166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5f3b94ca18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f5f3b91392e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f5f3b914a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f5f3b914c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f5f3a1faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f5f3b54d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f5f3b54637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f5f3b546529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f5f3a4c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f5f3a4c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f5f3c229d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f5f3c229e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x557fa8861095 in /usr/local/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1c6516c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f1c651166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f1c65588a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f1c6554f92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f1c65550a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f1c65550c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f1c63dfaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f1c6514d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f1c6514637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f1c65146529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f1c640c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f1c640c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f1c65e29d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f1c65e29e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x5567420e7095 in /usr/local/bin/python)

DeepEP timeout check failed: 0 (rank = 5)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f97ee4b9446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f97ee4636e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f97ee5a5a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f97ee56c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f97ee56da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f97ee56dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f97ed1faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f97ee49a69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f97ee49337b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f97ee493529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f97ed4c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f97ed4c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f97ef029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f97ef029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x5591d2ad9095 in /usr/local/bin/python)

DeepEP timeout check failed: 0 (rank = 0)
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f45b976c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f45b97166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f45b9c0aa18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f45b9bd192e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f45b9bd2a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f45b9bd2c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f45b83faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f45b974d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f45b974637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f45b9746529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f45b86c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f45b86c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f45ba429d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f45ba429e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55a9f1974095 in /usr/local/bin/python)

  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbb7176c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbb717166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbb71c12a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fbb71bd992e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fbb71bdaa57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fbb71bdac5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fbb703faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fbb7174d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fbb7174637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fbb71746529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fbb706c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fbb706c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fbb72429d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fbb72429e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55cb7794e095 in /usr/local/bin/python)

terminate called after throwing an instance of 'c10::Error'
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 0, nvl: 2, src NVL: 2, head: 287, tail: 287
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f029c36c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f029c3166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f029c77fa18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f029c74692e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f029c747a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f029c747c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f029b3faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f029c34d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f029c34637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f029c346529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f029b6c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f029b6c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f029d229d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f029d229e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55b74ffbb095 in /usr/local/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc15f36c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc15f3166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc15f728a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fc15f6ef92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fc15f6f0a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fc15f6f0c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fc15dffaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fc15f34d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fc15f34637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc15f346529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fc15e2c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fc15e2c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fc160029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fc160029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x56502677a095 in /usr/local/bin/python)

DeepEP timeout check failed: 0 (rank = 6)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbaa036c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbaa03166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbaa070ea18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fbaa06d592e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fbaa06d6a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fbaa06d6c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fba9effaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fbaa034d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fbaa034637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fbaa0346529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fba9f2c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fba9f2c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fbaa1029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fbaa1029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x5560d043d095 in /usr/local/bin/python)

W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 859 via signal SIGTERM
W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 860 via signal SIGTERM
W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 861 via signal SIGTERM
W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 862 via signal SIGTERM
W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 864 via signal SIGTERM
W0501 13:38:45.991000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 865 via signal SIGTERM
W0501 13:38:45.991000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 866 via signal SIGTERM
Traceback (most recent call last):
  File "/sgl-workspace/DeepEP/tests/test_internode.py", line 247, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 4 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/sgl-workspace/DeepEP/tests/test_internode.py", line 235, in test_loop
    test_main(i, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
  File "/sgl-workspace/DeepEP/tests/test_internode.py", line 179, in test_main
    t = bench(lambda: buffer.dispatch(**tune_args))[0]
  File "/sgl-workspace/DeepEP/tests/utils.py", line 81, in bench
    fn()
  File "/sgl-workspace/DeepEP/tests/test_internode.py", line 179, in <lambda>
    t = bench(lambda: buffer.dispatch(**tune_args))[0]
  File "/usr/local/lib/python3.10/site-packages/deep_ep-1.0.0+1590a08-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 282, in dispatch
    return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
  File "/usr/local/lib/python3.10/site-packages/deep_ep-1.0.0+1590a08-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 377, in internode_dispatch
    recv_x, recv_x_scales, _, _, _, _, _, _, _, _, _, _, _, _, event = self.runtime.internode_dispatch(
RuntimeError: Failed: CUDA error /sgl-workspace/DeepEP/csrc/kernels/internode.cu:1214 'unspecified launch failure'


### RANK1
python tests/test_internode.py
[config] num_tokens=4096, hidden=7168, num_topk_groups=2, num_topk=8
[layout] Kernel performance: 0.074 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 6.80 GB/s (RDMA), 22.42 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 9.22 GB/s (RDMA), 30.42 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 15.80 GB/s (RDMA), 52.12 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 20.63 GB/s (RDMA), 68.05 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 21.76 GB/s (RDMA), 71.79 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 22.47 GB/s (RDMA), 74.14 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 23.10 GB/s (RDMA), 76.21 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 22.82 GB/s (RDMA), 75.29 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 7.25 GB/s (RDMA), 23.93 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 8.92 GB/s (RDMA), 29.43 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 13.00 GB/s (RDMA), 42.89 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 19.46 GB/s (RDMA), 64.19 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 21.96 GB/s (RDMA), 72.44 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 22.55 GB/s (RDMA), 74.40 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 23.15 GB/s (RDMA), 76.36 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 23.34 GB/s (RDMA), 77.01 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 7.63 GB/s (RDMA), 25.18 GB/s (NVL) 
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbfc90b9446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbfc90636e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbfc91a5a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fbfc916c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fbfc916da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fbfc916dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fbfc7dfaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fbfc909a69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fbfc909337b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fbfc9093529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fbfc80c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fbfc80c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fbfc9e29d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fbfc9e29e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55eebaf97095 in /usr/local/bin/python)

DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc64fd6c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc64fd166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6501dca18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fc6501a392e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fc6501a4a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fc6501a4c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fc64e9faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fc64fd4d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fc64fd4637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc64fd46529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fc64ecc1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fc64ecc1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fc650a29d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fc650a29e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x56290437b095 in /usr/local/bin/python)

  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f42f26b9446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f42f26636e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f42f27a5a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f42f276c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f42f276da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f42f276dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f42f13faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f42f269a69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f42f269337b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f42f2693529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f42f16c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f42f16c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f42f3429d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f42f3429e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55f1717b4095 in /usr/local/bin/python)
terminate called after throwing an instance of '
c10::Error'
  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3cfb4b9446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f3cfb4636e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3cfb5a5a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f3cfb56c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f3cfb56da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f3cfb56dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f3cfa1faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f3cfb49a69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f3cfb49337b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f3cfb493529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f3cfa4c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f3cfa4c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f3cfc229d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f3cfc229e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x557458bad095 in /usr/local/bin/python)

  what():  CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f76e2f6c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f76e2f166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f76e3365a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f76e332c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f76e332da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f76e332dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f76e1bfaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f76e2f4d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f76e2f4637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f76e2f46529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f76e1ec1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f76e1ec1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f76e3c29d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f76e3c29e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x5596561c6095 in /usr/local/bin/python)

W0501 13:38:46.315000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1430 via signal SIGTERM
W0501 13:38:46.315000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1431 via signal SIGTERM
W0501 13:38:46.316000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1432 via signal SIGTERM
W0501 13:38:46.317000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1433 via signal SIGTERM
W0501 13:38:46.317000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1435 via signal SIGTERM
W0501 13:38:46.317000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1436 via signal SIGTERM
Traceback (most recent call last):
  File "/sgl-workspace/DeepEP/tests/test_internode.py", line 247, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/sgl-workspace/DeepEP/tests/test_internode.py", line 235, in test_loop
    test_main(i, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
  File "/sgl-workspace/DeepEP/tests/test_internode.py", line 179, in test_main
    t = bench(lambda: buffer.dispatch(**tune_args))[0]
  File "/sgl-workspace/DeepEP/tests/utils.py", line 81, in bench
    fn()
  File "/sgl-workspace/DeepEP/tests/test_internode.py", line 179, in <lambda>
    t = bench(lambda: buffer.dispatch(**tune_args))[0]
  File "/usr/local/lib/python3.10/site-packages/deep_ep-1.0.0+1590a08-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 282, in dispatch
    return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
  File "/usr/local/lib/python3.10/site-packages/deep_ep-1.0.0+1590a08-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 377, in internode_dispatch
    recv_x, recv_x_scales, _, _, _, _, _, _, _, _, _, _, _, _, event = self.runtime.internode_dispatch(
RuntimeError: Failed: CUDA error /sgl-workspace/DeepEP/csrc/kernels/internode.cu:1079 'unspecified launch failure'

whybeyoung avatar May 01 '25 05:05 whybeyoung

my env:

two node

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	SYS	SYS	0-47,96-143	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	PIX	NODE	SYS	SYS	0-47,96-143	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	NODE	SYS	SYS	0-47,96-143	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	PIX	SYS	SYS	0-47,96-143	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	PIX	NODE	48-95,144-191	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	NODE	NODE	48-95,144-191	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	NODE	PIX	48-95,144-191	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	NODE	NODE	48-95,144-191	1		N/A
NIC0	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	SYS	SYS
NIC1	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	NODE	 X 	SYS	SYS
NIC2	SYS	SYS	SYS	SYS	PIX	NODE	NODE	NODE	SYS	SYS	 X 	NODE
NIC3	SYS	SYS	SYS	SYS	NODE	NODE	PIX	NODE	SYS	SYS	NODE	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
hca_id:	mlx5_bond_0
	transport:			InfiniBand (0)
	fw_ver:				32.39.3804
	node_guid:			58a2:e103:00d5:28d4
	sys_image_guid:			58a2:e103:00d5:28d4
	vendor_id:			0x02c9
	vendor_part_id:			41692
	hw_ver:				0x1
	board_id:			MT_0000000884
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	mlx5_bond_1
	transport:			InfiniBand (0)
	fw_ver:				32.39.3804
	node_guid:			58a2:e103:00f7:6a90
	sys_image_guid:			58a2:e103:00f7:6a90
	vendor_id:			0x02c9
	vendor_part_id:			41692
	hw_ver:				0x1
	board_id:			MT_0000000884
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	mlx5_bond_2
	transport:			InfiniBand (0)
	fw_ver:				32.39.3804
	node_guid:			58a2:e103:00d8:061c
	sys_image_guid:			58a2:e103:00d8:061c
	vendor_id:			0x02c9
	vendor_part_id:			41692
	hw_ver:				0x1
	board_id:			MT_0000000884
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	mlx5_bond_3
	transport:			InfiniBand (0)
	fw_ver:				32.39.3804
	node_guid:			58a2:e103:00dd:eee6
	sys_image_guid:			58a2:e103:00dd:eee6
	vendor_id:			0x02c9
	vendor_part_id:			41692
	hw_ver:				0x1
	board_id:			MT_0000000884
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

@sphish In my test environment, I analyzed this problem. When I configure these two environment variables(NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"), the program will time out and reappear 100%.

If I don't configure these two environment variables, the program won't time out, but the bandwidth is only over 20 GB/s.

config: test_ll_compatibility=True export NVSHMEM_ENABLE_NIC_PE_MAPPING=1 export NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"

test result:


[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 18.57 GB/s (RDMA), 60.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 37.50 GB/s (RDMA), 122.79 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 41.02 GB/s (RDMA), 134.32 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 41.88 GB/s (RDMA), 137.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 39.79 GB/s (RDMA), 130.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 41.49 GB/s (RDMA), 135.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 40.73 GB/s (RDMA), 133.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 41.08 GB/s (RDMA), 134.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 17.82 GB/s (RDMA), 58.35 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 36.30 GB/s (RDMA), 118.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 41.62 GB/s (RDMA), 136.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 41.98 GB/s (RDMA), 137.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 42.10 GB/s (RDMA), 137.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 41.07 GB/s (RDMA), 134.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 41.75 GB/s (RDMA), 136.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 40.87 GB/s (RDMA), 133.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 17.91 GB/s (RDMA), 58.64 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 35.96 GB/s (RDMA), 117.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 42.50 GB/s (RDMA), 139.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 42.35 GB/s (RDMA), 138.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 42.00 GB/s (RDMA), 137.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 40.97 GB/s (RDMA), 134.18 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 41.41 GB/s (RDMA), 135.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 40.30 GB/s (RDMA), 131.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 17.86 GB/s (RDMA), 58.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 35.70 GB/s (RDMA), 116.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 42.54 GB/s (RDMA), 139.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 42.18 GB/s (RDMA), 138.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 41.95 GB/s (RDMA), 137.38 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 41.59 GB/s (RDMA), 136.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 41.31 GB/s (RDMA), 135.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 40.84 GB/s (RDMA), 133.74 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 18.29 GB/s (RDMA), 59.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 36.32 GB/s (RDMA), 118.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 42.53 GB/s (RDMA), 139.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 42.30 GB/s (RDMA), 138.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 41.94 GB/s (RDMA), 137.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 41.54 GB/s (RDMA), 136.02 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 41.34 GB/s (RDMA), 135.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 40.89 GB/s (RDMA), 133.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 17.86 GB/s (RDMA), 58.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 35.49 GB/s (RDMA), 116.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 41.48 GB/s (RDMA), 135.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 41.61 GB/s (RDMA), 136.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 41.89 GB/s (RDMA), 137.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 41.58 GB/s (RDMA), 136.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 41.01 GB/s (RDMA), 134.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 40.21 GB/s (RDMA), 131.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 17.75 GB/s (RDMA), 58.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 36.20 GB/s (RDMA), 118.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 42.47 GB/s (RDMA), 139.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 41.26 GB/s (RDMA), 135.10 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 41.98 GB/s (RDMA), 137.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 41.58 GB/s (RDMA), 136.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 41.26 GB/s (RDMA), 135.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 40.85 GB/s (RDMA), 133.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 18.23 GB/s (RDMA), 59.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 36.25 GB/s (RDMA), 118.70 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 42.53 GB/s (RDMA), 139.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 42.24 GB/s (RDMA), 138.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 41.95 GB/s (RDMA), 137.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 41.53 GB/s (RDMA), 136.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 41.32 GB/s (RDMA), 135.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 40.81 GB/s (RDMA), 133.64 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 16, RDMA chunk 12: 42.54 GB/s (RDMA), 139.30 GB/s (NVL)

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 35.60 GB/s (RDMA), 116.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 44.63 GB/s (RDMA), 146.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 44.27 GB/s (RDMA), 144.97 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 44.40 GB/s (RDMA), 145.40 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 44.09 GB/s (RDMA), 144.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 44.00 GB/s (RDMA), 144.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 43.79 GB/s (RDMA), 143.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 43.44 GB/s (RDMA), 142.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 34.80 GB/s (RDMA), 113.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 44.26 GB/s (RDMA), 144.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 44.06 GB/s (RDMA), 144.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 44.04 GB/s (RDMA), 144.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 44.03 GB/s (RDMA), 144.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 43.87 GB/s (RDMA), 143.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 43.67 GB/s (RDMA), 143.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 43.34 GB/s (RDMA), 141.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 34.43 GB/s (RDMA), 112.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 44.46 GB/s (RDMA), 145.59 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 44.23 GB/s (RDMA), 144.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 44.16 GB/s (RDMA), 144.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 43.92 GB/s (RDMA), 143.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 43.68 GB/s (RDMA), 143.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 43.41 GB/s (RDMA), 142.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 43.21 GB/s (RDMA), 141.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 35.71 GB/s (RDMA), 116.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 44.70 GB/s (RDMA), 146.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 44.03 GB/s (RDMA), 144.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 44.13 GB/s (RDMA), 144.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 43.81 GB/s (RDMA), 143.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 43.54 GB/s (RDMA), 142.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 43.29 GB/s (RDMA), 141.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 43.02 GB/s (RDMA), 140.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 35.72 GB/s (RDMA), 116.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 44.68 GB/s (RDMA), 146.32 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 43.73 GB/s (RDMA), 143.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 44.18 GB/s (RDMA), 144.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 43.82 GB/s (RDMA), 143.50 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 43.50 GB/s (RDMA), 142.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 43.24 GB/s (RDMA), 141.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 43.05 GB/s (RDMA), 140.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 33.22 GB/s (RDMA), 108.80 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 44.63 GB/s (RDMA), 146.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 44.19 GB/s (RDMA), 144.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 44.13 GB/s (RDMA), 144.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 43.79 GB/s (RDMA), 143.41 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 43.51 GB/s (RDMA), 142.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 43.23 GB/s (RDMA), 141.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 43.23 GB/s (RDMA), 141.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 34.83 GB/s (RDMA), 114.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 44.66 GB/s (RDMA), 146.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 44.20 GB/s (RDMA), 144.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 44.16 GB/s (RDMA), 144.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 43.83 GB/s (RDMA), 143.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 43.55 GB/s (RDMA), 142.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 43.25 GB/s (RDMA), 141.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 43.27 GB/s (RDMA), 141.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 34.91 GB/s (RDMA), 114.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 44.66 GB/s (RDMA), 146.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 44.26 GB/s (RDMA), 144.93 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 44.16 GB/s (RDMA), 144.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 43.83 GB/s (RDMA), 143.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 43.50 GB/s (RDMA), 142.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 43.23 GB/s (RDMA), 141.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 43.17 GB/s (RDMA), 141.38 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 16, RDMA chunk 8: 44.70 GB/s (RDMA), 146.37 GB/s (NVL)

[tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 42.17 GB/s (RDMA), 138.09 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 43.48 GB/s (RDMA), 142.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 43.14 GB/s (RDMA), 141.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 42.82 GB/s (RDMA), 140.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 42.39 GB/s (RDMA), 138.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 42.03 GB/s (RDMA), 137.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 41.81 GB/s (RDMA), 136.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 43.48 GB/s (RDMA), 142.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 43.45 GB/s (RDMA), 142.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 43.25 GB/s (RDMA), 141.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 42.93 GB/s (RDMA), 140.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 42.56 GB/s (RDMA), 139.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 42.23 GB/s (RDMA), 138.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 42.02 GB/s (RDMA), 137.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 43.66 GB/s (RDMA), 142.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 43.52 GB/s (RDMA), 142.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 43.22 GB/s (RDMA), 141.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 42.87 GB/s (RDMA), 140.38 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 42.42 GB/s (RDMA), 138.90 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 42.17 GB/s (RDMA), 138.09 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 42.05 GB/s (RDMA), 137.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 43.08 GB/s (RDMA), 141.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 43.42 GB/s (RDMA), 142.18 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 43.08 GB/s (RDMA), 141.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 42.79 GB/s (RDMA), 140.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 42.47 GB/s (RDMA), 139.06 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 42.08 GB/s (RDMA), 137.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 42.00 GB/s (RDMA), 137.53 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 8: 43.66 GB/s (RDMA), 142.96 GB/s (NVL)


[rank 0] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.74 us, min_t=168.26 us, max_t=179.33 us
[rank 5] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.72 us, min_t=167.87 us, max_t=176.86 us
[rank 3] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.94 us, min_t=167.78 us, max_t=177.95 us
[rank 1] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.97 us, min_t=166.18 us, max_t=178.37 us
[rank 6] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.74 us, min_t=167.10 us, max_t=177.66 us
[rank 7] Dispatch + combine bandwidth: 12.14 GB/s, avg_t=172.73 us, min_t=165.98 us, max_t=179.81 us
[rank 4] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.83 us, min_t=166.40 us, max_t=176.74 us
[rank 2] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.85 us, min_t=167.78 us, max_t=176.80 us
[rank 6] Dispatch bandwidth: 9.99 GB/s, avg_t=71.06 us | Combine bandwidth: 14.03 GB/s, avg_t=97.83 us
[rank 4] Dispatch bandwidth: 9.32 GB/s, avg_t=76.18 us | Combine bandwidth: 13.88 GB/s, avg_t=98.85 us
[rank 0] Dispatch bandwidth: 10.51 GB/s, avg_t=67.51 us | Combine bandwidth: 13.71 GB/s, avg_t=100.06 us
[rank 7] Dispatch bandwidth: 10.29 GB/s, avg_t=69.47 us | Combine bandwidth: 14.18 GB/s, avg_t=97.50 us
[rank 3] Dispatch bandwidth: 9.00 GB/s, avg_t=78.86 us | Combine bandwidth: 13.90 GB/s, avg_t=98.72 us
[rank 1] Dispatch bandwidth: 8.79 GB/s, avg_t=80.75 us | Combine bandwidth: 13.66 GB/s, avg_t=100.48 us
[rank 5] Dispatch bandwidth: 10.21 GB/s, avg_t=69.53 us | Combine bandwidth: 13.83 GB/s, avg_t=99.24 us
[rank 2] Dispatch bandwidth: 10.52 GB/s, avg_t=67.44 us | Combine bandwidth: 13.55 GB/s, avg_t=101.29 us
[rank 0] Dispatch send/recv time: 17.49 us | Combine send/recv time: 19.97 us
[rank 7] Dispatch send/recv time: 18.52 us | Combine send/recv time: 19.97 us
[rank 2] Dispatch send/recv time: 18.38 us | Combine send/recv time: 20.40 us
[rank 4] Dispatch send/recv time: 18.54 us | Combine send/recv time: 20.16 us
[rank 1] Dispatch send/recv time: 18.32 us | Combine send/recv time: 19.88 us
[rank 5] Dispatch send/recv time: 18.56 us | Combine send/recv time: 20.14 us
[rank 3] Dispatch send/recv time: 18.68 us | Combine send/recv time: 19.77 us
[rank 6] Dispatch send/recv time: 18.79 us | Combine send/recv time: 19.95 us

config: test_ll_compatibility=True export NVSHMEM_HCA_LIST="mlx5_bond_0:1,mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1"

test result:


[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 11.47 GB/s (RDMA), 37.83 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 16.59 GB/s (RDMA), 54.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 19.27 GB/s (RDMA), 63.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 20.84 GB/s (RDMA), 68.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 22.06 GB/s (RDMA), 72.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 23.05 GB/s (RDMA), 76.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 23.40 GB/s (RDMA), 77.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 23.92 GB/s (RDMA), 78.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 11.48 GB/s (RDMA), 37.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 16.56 GB/s (RDMA), 54.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 19.19 GB/s (RDMA), 63.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 20.73 GB/s (RDMA), 68.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 21.96 GB/s (RDMA), 72.45 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 23.11 GB/s (RDMA), 76.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 23.56 GB/s (RDMA), 77.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 23.86 GB/s (RDMA), 78.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 16.53 GB/s (RDMA), 54.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 19.16 GB/s (RDMA), 63.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 20.56 GB/s (RDMA), 67.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 21.98 GB/s (RDMA), 72.50 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 23.15 GB/s (RDMA), 76.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 23.53 GB/s (RDMA), 77.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 23.87 GB/s (RDMA), 78.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 16.59 GB/s (RDMA), 54.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 19.19 GB/s (RDMA), 63.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 20.57 GB/s (RDMA), 67.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 22.00 GB/s (RDMA), 72.59 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 23.07 GB/s (RDMA), 76.11 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 23.51 GB/s (RDMA), 77.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 23.85 GB/s (RDMA), 78.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 11.48 GB/s (RDMA), 37.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 16.53 GB/s (RDMA), 54.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 19.18 GB/s (RDMA), 63.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 20.54 GB/s (RDMA), 67.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 22.02 GB/s (RDMA), 72.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 23.06 GB/s (RDMA), 76.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 23.53 GB/s (RDMA), 77.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 23.97 GB/s (RDMA), 79.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 11.45 GB/s (RDMA), 37.77 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 16.54 GB/s (RDMA), 54.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 19.18 GB/s (RDMA), 63.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 20.53 GB/s (RDMA), 67.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 21.79 GB/s (RDMA), 71.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 23.06 GB/s (RDMA), 76.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 23.51 GB/s (RDMA), 77.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 23.78 GB/s (RDMA), 78.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 11.46 GB/s (RDMA), 37.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 16.54 GB/s (RDMA), 54.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 19.18 GB/s (RDMA), 63.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 20.57 GB/s (RDMA), 67.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 22.07 GB/s (RDMA), 72.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 23.00 GB/s (RDMA), 75.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 23.59 GB/s (RDMA), 77.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 23.88 GB/s (RDMA), 78.78 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 16.57 GB/s (RDMA), 54.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 19.20 GB/s (RDMA), 63.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 20.54 GB/s (RDMA), 67.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 22.03 GB/s (RDMA), 72.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 23.04 GB/s (RDMA), 75.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 23.51 GB/s (RDMA), 77.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 23.84 GB/s (RDMA), 78.64 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 20, RDMA chunk 32: 23.97 GB/s (RDMA), 79.07 GB/s (NVL)

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 16.40 GB/s (RDMA), 54.11 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 20.66 GB/s (RDMA), 68.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.89 GB/s (RDMA), 75.51 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 24.60 GB/s (RDMA), 81.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 24.64 GB/s (RDMA), 81.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 24.58 GB/s (RDMA), 81.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 24.53 GB/s (RDMA), 80.93 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 24.50 GB/s (RDMA), 80.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 16.39 GB/s (RDMA), 54.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 20.72 GB/s (RDMA), 68.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 22.86 GB/s (RDMA), 75.40 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 24.58 GB/s (RDMA), 81.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 24.64 GB/s (RDMA), 81.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 24.61 GB/s (RDMA), 81.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 24.52 GB/s (RDMA), 80.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 24.45 GB/s (RDMA), 80.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 16.37 GB/s (RDMA), 54.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 20.71 GB/s (RDMA), 68.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 22.86 GB/s (RDMA), 75.43 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 24.60 GB/s (RDMA), 81.15 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 24.58 GB/s (RDMA), 81.10 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 24.48 GB/s (RDMA), 80.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 24.51 GB/s (RDMA), 80.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 24.44 GB/s (RDMA), 80.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 16.38 GB/s (RDMA), 54.03 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 20.71 GB/s (RDMA), 68.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 22.90 GB/s (RDMA), 75.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 24.62 GB/s (RDMA), 81.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 24.55 GB/s (RDMA), 81.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 24.48 GB/s (RDMA), 80.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 24.48 GB/s (RDMA), 80.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 24.45 GB/s (RDMA), 80.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 16.38 GB/s (RDMA), 54.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 20.70 GB/s (RDMA), 68.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 22.86 GB/s (RDMA), 75.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 24.58 GB/s (RDMA), 81.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 24.59 GB/s (RDMA), 81.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 24.44 GB/s (RDMA), 80.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 24.46 GB/s (RDMA), 80.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 24.41 GB/s (RDMA), 80.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 16.38 GB/s (RDMA), 54.03 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 20.72 GB/s (RDMA), 68.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 22.87 GB/s (RDMA), 75.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 24.63 GB/s (RDMA), 81.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 24.52 GB/s (RDMA), 80.90 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 24.50 GB/s (RDMA), 80.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 24.46 GB/s (RDMA), 80.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 24.42 GB/s (RDMA), 80.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 16.36 GB/s (RDMA), 53.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 20.69 GB/s (RDMA), 68.26 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 22.88 GB/s (RDMA), 75.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 24.63 GB/s (RDMA), 81.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 24.52 GB/s (RDMA), 80.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 24.39 GB/s (RDMA), 80.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 24.50 GB/s (RDMA), 80.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 24.42 GB/s (RDMA), 80.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 16.43 GB/s (RDMA), 54.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 20.72 GB/s (RDMA), 68.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 22.88 GB/s (RDMA), 75.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 24.63 GB/s (RDMA), 81.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 24.62 GB/s (RDMA), 81.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 24.47 GB/s (RDMA), 80.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 24.42 GB/s (RDMA), 80.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 24.35 GB/s (RDMA), 80.34 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 4, RDMA chunk 20: 24.64 GB/s (RDMA), 81.30 GB/s (NVL)

[tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 20.47 GB/s (RDMA), 67.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 23.02 GB/s (RDMA), 75.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 24.48 GB/s (RDMA), 80.77 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 24.41 GB/s (RDMA), 80.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 24.30 GB/s (RDMA), 80.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 24.20 GB/s (RDMA), 79.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 24.16 GB/s (RDMA), 79.69 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 20.81 GB/s (RDMA), 68.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 23.00 GB/s (RDMA), 75.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 24.48 GB/s (RDMA), 80.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 24.44 GB/s (RDMA), 80.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 24.36 GB/s (RDMA), 80.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 24.27 GB/s (RDMA), 80.05 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 24.24 GB/s (RDMA), 79.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 20.81 GB/s (RDMA), 68.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 22.98 GB/s (RDMA), 75.79 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 24.43 GB/s (RDMA), 80.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 24.43 GB/s (RDMA), 80.59 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 24.35 GB/s (RDMA), 80.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 24.24 GB/s (RDMA), 79.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 24.22 GB/s (RDMA), 79.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 20.85 GB/s (RDMA), 68.78 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.96 GB/s (RDMA), 75.74 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 24.42 GB/s (RDMA), 80.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 24.44 GB/s (RDMA), 80.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 24.32 GB/s (RDMA), 80.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 24.22 GB/s (RDMA), 79.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 24.20 GB/s (RDMA), 79.84 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 16: 24.48 GB/s (RDMA), 80.77 GB/s (NVL)

@whybeyoung Have you also configured these two environment variables? NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"

Can you show me your machine environment? command: nvidia-smi topo -m ibv_devinfo

alpha-baby avatar May 06 '25 06:05 alpha-baby

Hi there, could you please try this branch https://github.com/deepseek-ai/DeepEP/tree/try_fix_roce_mqp and see if it resolves the issue?

sphish avatar May 06 '25 07:05 sphish

Hi there, could you please try this branch https://github.com/deepseek-ai/DeepEP/tree/try_fix_roce_mqp and see if it resolves the issue?

Thank you for your help.

The submission still does not solve the problem, or the same error is reported. I tried many times and it was 100% reappearance.

alpha-baby avatar May 06 '25 13:05 alpha-baby

my env:

two node

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	SYS	SYS	0-47,96-143	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	PIX	NODE	SYS	SYS	0-47,96-143	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	NODE	SYS	SYS	0-47,96-143	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	PIX	SYS	SYS	0-47,96-143	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	PIX	NODE	48-95,144-191	1		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	NODE	NODE	48-95,144-191	1		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	NODE	PIX	48-95,144-191	1		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	NODE	NODE	48-95,144-191	1		N/A
NIC0	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	SYS	SYS
NIC1	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	NODE	 X 	SYS	SYS
NIC2	SYS	SYS	SYS	SYS	PIX	NODE	NODE	NODE	SYS	SYS	 X 	NODE
NIC3	SYS	SYS	SYS	SYS	NODE	NODE	PIX	NODE	SYS	SYS	NODE	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
hca_id:	mlx5_bond_0
	transport:			InfiniBand (0)
	fw_ver:				32.39.3804
	node_guid:			58a2:e103:00d5:28d4
	sys_image_guid:			58a2:e103:00d5:28d4
	vendor_id:			0x02c9
	vendor_part_id:			41692
	hw_ver:				0x1
	board_id:			MT_0000000884
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	mlx5_bond_1
	transport:			InfiniBand (0)
	fw_ver:				32.39.3804
	node_guid:			58a2:e103:00f7:6a90
	sys_image_guid:			58a2:e103:00f7:6a90
	vendor_id:			0x02c9
	vendor_part_id:			41692
	hw_ver:				0x1
	board_id:			MT_0000000884
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	mlx5_bond_2
	transport:			InfiniBand (0)
	fw_ver:				32.39.3804
	node_guid:			58a2:e103:00d8:061c
	sys_image_guid:			58a2:e103:00d8:061c
	vendor_id:			0x02c9
	vendor_part_id:			41692
	hw_ver:				0x1
	board_id:			MT_0000000884
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	mlx5_bond_3
	transport:			InfiniBand (0)
	fw_ver:				32.39.3804
	node_guid:			58a2:e103:00dd:eee6
	sys_image_guid:			58a2:e103:00dd:eee6
	vendor_id:			0x02c9
	vendor_part_id:			41692
	hw_ver:				0x1
	board_id:			MT_0000000884
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

@sphish In my test environment, I analyzed this problem. When I configure these two environment variables(NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"), the program will time out and reappear 100%.

If I don't configure these two environment variables, the program won't time out, but the bandwidth is only over 20 GB/s.

config: test_ll_compatibility=True export NVSHMEM_ENABLE_NIC_PE_MAPPING=1 export NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"

test result:


[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 18.57 GB/s (RDMA), 60.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 37.50 GB/s (RDMA), 122.79 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 41.02 GB/s (RDMA), 134.32 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 41.88 GB/s (RDMA), 137.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 39.79 GB/s (RDMA), 130.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 41.49 GB/s (RDMA), 135.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 40.73 GB/s (RDMA), 133.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 41.08 GB/s (RDMA), 134.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 17.82 GB/s (RDMA), 58.35 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 36.30 GB/s (RDMA), 118.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 41.62 GB/s (RDMA), 136.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 41.98 GB/s (RDMA), 137.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 42.10 GB/s (RDMA), 137.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 41.07 GB/s (RDMA), 134.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 41.75 GB/s (RDMA), 136.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 40.87 GB/s (RDMA), 133.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 17.91 GB/s (RDMA), 58.64 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 35.96 GB/s (RDMA), 117.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 42.50 GB/s (RDMA), 139.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 42.35 GB/s (RDMA), 138.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 42.00 GB/s (RDMA), 137.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 40.97 GB/s (RDMA), 134.18 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 41.41 GB/s (RDMA), 135.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 40.30 GB/s (RDMA), 131.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 17.86 GB/s (RDMA), 58.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 35.70 GB/s (RDMA), 116.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 42.54 GB/s (RDMA), 139.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 42.18 GB/s (RDMA), 138.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 41.95 GB/s (RDMA), 137.38 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 41.59 GB/s (RDMA), 136.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 41.31 GB/s (RDMA), 135.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 40.84 GB/s (RDMA), 133.74 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 18.29 GB/s (RDMA), 59.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 36.32 GB/s (RDMA), 118.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 42.53 GB/s (RDMA), 139.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 42.30 GB/s (RDMA), 138.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 41.94 GB/s (RDMA), 137.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 41.54 GB/s (RDMA), 136.02 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 41.34 GB/s (RDMA), 135.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 40.89 GB/s (RDMA), 133.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 17.86 GB/s (RDMA), 58.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 35.49 GB/s (RDMA), 116.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 41.48 GB/s (RDMA), 135.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 41.61 GB/s (RDMA), 136.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 41.89 GB/s (RDMA), 137.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 41.58 GB/s (RDMA), 136.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 41.01 GB/s (RDMA), 134.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 40.21 GB/s (RDMA), 131.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 17.75 GB/s (RDMA), 58.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 36.20 GB/s (RDMA), 118.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 42.47 GB/s (RDMA), 139.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 41.26 GB/s (RDMA), 135.10 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 41.98 GB/s (RDMA), 137.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 41.58 GB/s (RDMA), 136.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 41.26 GB/s (RDMA), 135.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 40.85 GB/s (RDMA), 133.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 18.23 GB/s (RDMA), 59.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 36.25 GB/s (RDMA), 118.70 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 42.53 GB/s (RDMA), 139.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 42.24 GB/s (RDMA), 138.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 41.95 GB/s (RDMA), 137.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 41.53 GB/s (RDMA), 136.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 41.32 GB/s (RDMA), 135.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 40.81 GB/s (RDMA), 133.64 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 16, RDMA chunk 12: 42.54 GB/s (RDMA), 139.30 GB/s (NVL)

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 35.60 GB/s (RDMA), 116.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 44.63 GB/s (RDMA), 146.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 44.27 GB/s (RDMA), 144.97 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 44.40 GB/s (RDMA), 145.40 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 44.09 GB/s (RDMA), 144.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 44.00 GB/s (RDMA), 144.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 43.79 GB/s (RDMA), 143.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 43.44 GB/s (RDMA), 142.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 34.80 GB/s (RDMA), 113.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 44.26 GB/s (RDMA), 144.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 44.06 GB/s (RDMA), 144.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 44.04 GB/s (RDMA), 144.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 44.03 GB/s (RDMA), 144.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 43.87 GB/s (RDMA), 143.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 43.67 GB/s (RDMA), 143.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 43.34 GB/s (RDMA), 141.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 34.43 GB/s (RDMA), 112.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 44.46 GB/s (RDMA), 145.59 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 44.23 GB/s (RDMA), 144.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 44.16 GB/s (RDMA), 144.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 43.92 GB/s (RDMA), 143.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 43.68 GB/s (RDMA), 143.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 43.41 GB/s (RDMA), 142.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 43.21 GB/s (RDMA), 141.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 35.71 GB/s (RDMA), 116.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 44.70 GB/s (RDMA), 146.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 44.03 GB/s (RDMA), 144.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 44.13 GB/s (RDMA), 144.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 43.81 GB/s (RDMA), 143.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 43.54 GB/s (RDMA), 142.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 43.29 GB/s (RDMA), 141.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 43.02 GB/s (RDMA), 140.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 35.72 GB/s (RDMA), 116.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 44.68 GB/s (RDMA), 146.32 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 43.73 GB/s (RDMA), 143.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 44.18 GB/s (RDMA), 144.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 43.82 GB/s (RDMA), 143.50 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 43.50 GB/s (RDMA), 142.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 43.24 GB/s (RDMA), 141.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 43.05 GB/s (RDMA), 140.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 33.22 GB/s (RDMA), 108.80 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 44.63 GB/s (RDMA), 146.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 44.19 GB/s (RDMA), 144.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 44.13 GB/s (RDMA), 144.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 43.79 GB/s (RDMA), 143.41 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 43.51 GB/s (RDMA), 142.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 43.23 GB/s (RDMA), 141.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 43.23 GB/s (RDMA), 141.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 34.83 GB/s (RDMA), 114.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 44.66 GB/s (RDMA), 146.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 44.20 GB/s (RDMA), 144.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 44.16 GB/s (RDMA), 144.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 43.83 GB/s (RDMA), 143.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 43.55 GB/s (RDMA), 142.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 43.25 GB/s (RDMA), 141.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 43.27 GB/s (RDMA), 141.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 34.91 GB/s (RDMA), 114.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 44.66 GB/s (RDMA), 146.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 44.26 GB/s (RDMA), 144.93 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 44.16 GB/s (RDMA), 144.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 43.83 GB/s (RDMA), 143.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 43.50 GB/s (RDMA), 142.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 43.23 GB/s (RDMA), 141.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 43.17 GB/s (RDMA), 141.38 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 16, RDMA chunk 8: 44.70 GB/s (RDMA), 146.37 GB/s (NVL)

[tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 42.17 GB/s (RDMA), 138.09 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 43.48 GB/s (RDMA), 142.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 43.14 GB/s (RDMA), 141.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 42.82 GB/s (RDMA), 140.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 42.39 GB/s (RDMA), 138.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 42.03 GB/s (RDMA), 137.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 41.81 GB/s (RDMA), 136.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 43.48 GB/s (RDMA), 142.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 43.45 GB/s (RDMA), 142.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 43.25 GB/s (RDMA), 141.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 42.93 GB/s (RDMA), 140.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 42.56 GB/s (RDMA), 139.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 42.23 GB/s (RDMA), 138.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 42.02 GB/s (RDMA), 137.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 43.66 GB/s (RDMA), 142.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 43.52 GB/s (RDMA), 142.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 43.22 GB/s (RDMA), 141.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 42.87 GB/s (RDMA), 140.38 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 42.42 GB/s (RDMA), 138.90 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 42.17 GB/s (RDMA), 138.09 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 42.05 GB/s (RDMA), 137.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 43.08 GB/s (RDMA), 141.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 43.42 GB/s (RDMA), 142.18 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 43.08 GB/s (RDMA), 141.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 42.79 GB/s (RDMA), 140.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 42.47 GB/s (RDMA), 139.06 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 42.08 GB/s (RDMA), 137.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 42.00 GB/s (RDMA), 137.53 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 8: 43.66 GB/s (RDMA), 142.96 GB/s (NVL)


[rank 0] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.74 us, min_t=168.26 us, max_t=179.33 us
[rank 5] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.72 us, min_t=167.87 us, max_t=176.86 us
[rank 3] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.94 us, min_t=167.78 us, max_t=177.95 us
[rank 1] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.97 us, min_t=166.18 us, max_t=178.37 us
[rank 6] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.74 us, min_t=167.10 us, max_t=177.66 us
[rank 7] Dispatch + combine bandwidth: 12.14 GB/s, avg_t=172.73 us, min_t=165.98 us, max_t=179.81 us
[rank 4] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.83 us, min_t=166.40 us, max_t=176.74 us
[rank 2] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.85 us, min_t=167.78 us, max_t=176.80 us
[rank 6] Dispatch bandwidth: 9.99 GB/s, avg_t=71.06 us | Combine bandwidth: 14.03 GB/s, avg_t=97.83 us
[rank 4] Dispatch bandwidth: 9.32 GB/s, avg_t=76.18 us | Combine bandwidth: 13.88 GB/s, avg_t=98.85 us
[rank 0] Dispatch bandwidth: 10.51 GB/s, avg_t=67.51 us | Combine bandwidth: 13.71 GB/s, avg_t=100.06 us
[rank 7] Dispatch bandwidth: 10.29 GB/s, avg_t=69.47 us | Combine bandwidth: 14.18 GB/s, avg_t=97.50 us
[rank 3] Dispatch bandwidth: 9.00 GB/s, avg_t=78.86 us | Combine bandwidth: 13.90 GB/s, avg_t=98.72 us
[rank 1] Dispatch bandwidth: 8.79 GB/s, avg_t=80.75 us | Combine bandwidth: 13.66 GB/s, avg_t=100.48 us
[rank 5] Dispatch bandwidth: 10.21 GB/s, avg_t=69.53 us | Combine bandwidth: 13.83 GB/s, avg_t=99.24 us
[rank 2] Dispatch bandwidth: 10.52 GB/s, avg_t=67.44 us | Combine bandwidth: 13.55 GB/s, avg_t=101.29 us
[rank 0] Dispatch send/recv time: 17.49 us | Combine send/recv time: 19.97 us
[rank 7] Dispatch send/recv time: 18.52 us | Combine send/recv time: 19.97 us
[rank 2] Dispatch send/recv time: 18.38 us | Combine send/recv time: 20.40 us
[rank 4] Dispatch send/recv time: 18.54 us | Combine send/recv time: 20.16 us
[rank 1] Dispatch send/recv time: 18.32 us | Combine send/recv time: 19.88 us
[rank 5] Dispatch send/recv time: 18.56 us | Combine send/recv time: 20.14 us
[rank 3] Dispatch send/recv time: 18.68 us | Combine send/recv time: 19.77 us
[rank 6] Dispatch send/recv time: 18.79 us | Combine send/recv time: 19.95 us

config: test_ll_compatibility=True export NVSHMEM_HCA_LIST="mlx5_bond_0:1,mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1"

test result:


[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 11.47 GB/s (RDMA), 37.83 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 16.59 GB/s (RDMA), 54.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 19.27 GB/s (RDMA), 63.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 20.84 GB/s (RDMA), 68.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 22.06 GB/s (RDMA), 72.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 23.05 GB/s (RDMA), 76.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 23.40 GB/s (RDMA), 77.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 23.92 GB/s (RDMA), 78.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 11.48 GB/s (RDMA), 37.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 16.56 GB/s (RDMA), 54.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 19.19 GB/s (RDMA), 63.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 20.73 GB/s (RDMA), 68.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 21.96 GB/s (RDMA), 72.45 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 23.11 GB/s (RDMA), 76.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 23.56 GB/s (RDMA), 77.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 23.86 GB/s (RDMA), 78.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 16.53 GB/s (RDMA), 54.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 19.16 GB/s (RDMA), 63.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 20.56 GB/s (RDMA), 67.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 21.98 GB/s (RDMA), 72.50 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 23.15 GB/s (RDMA), 76.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 23.53 GB/s (RDMA), 77.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 23.87 GB/s (RDMA), 78.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 16.59 GB/s (RDMA), 54.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 19.19 GB/s (RDMA), 63.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 20.57 GB/s (RDMA), 67.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 22.00 GB/s (RDMA), 72.59 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 23.07 GB/s (RDMA), 76.11 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 23.51 GB/s (RDMA), 77.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 23.85 GB/s (RDMA), 78.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 11.48 GB/s (RDMA), 37.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 16.53 GB/s (RDMA), 54.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 19.18 GB/s (RDMA), 63.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 20.54 GB/s (RDMA), 67.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 22.02 GB/s (RDMA), 72.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 23.06 GB/s (RDMA), 76.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 23.53 GB/s (RDMA), 77.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 23.97 GB/s (RDMA), 79.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 11.45 GB/s (RDMA), 37.77 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 16.54 GB/s (RDMA), 54.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 19.18 GB/s (RDMA), 63.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 20.53 GB/s (RDMA), 67.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 21.79 GB/s (RDMA), 71.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 23.06 GB/s (RDMA), 76.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 23.51 GB/s (RDMA), 77.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 23.78 GB/s (RDMA), 78.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 11.46 GB/s (RDMA), 37.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 16.54 GB/s (RDMA), 54.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 19.18 GB/s (RDMA), 63.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 20.57 GB/s (RDMA), 67.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 22.07 GB/s (RDMA), 72.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 23.00 GB/s (RDMA), 75.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 23.59 GB/s (RDMA), 77.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 23.88 GB/s (RDMA), 78.78 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 16.57 GB/s (RDMA), 54.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 19.20 GB/s (RDMA), 63.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 20.54 GB/s (RDMA), 67.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 22.03 GB/s (RDMA), 72.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 23.04 GB/s (RDMA), 75.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 23.51 GB/s (RDMA), 77.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 23.84 GB/s (RDMA), 78.64 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 20, RDMA chunk 32: 23.97 GB/s (RDMA), 79.07 GB/s (NVL)

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 16.40 GB/s (RDMA), 54.11 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 20.66 GB/s (RDMA), 68.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.89 GB/s (RDMA), 75.51 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 24.60 GB/s (RDMA), 81.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 24.64 GB/s (RDMA), 81.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 24.58 GB/s (RDMA), 81.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 24.53 GB/s (RDMA), 80.93 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 24.50 GB/s (RDMA), 80.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 16.39 GB/s (RDMA), 54.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 20.72 GB/s (RDMA), 68.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 22.86 GB/s (RDMA), 75.40 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 24.58 GB/s (RDMA), 81.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 24.64 GB/s (RDMA), 81.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 24.61 GB/s (RDMA), 81.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 24.52 GB/s (RDMA), 80.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 24.45 GB/s (RDMA), 80.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 16.37 GB/s (RDMA), 54.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 20.71 GB/s (RDMA), 68.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 22.86 GB/s (RDMA), 75.43 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 24.60 GB/s (RDMA), 81.15 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 24.58 GB/s (RDMA), 81.10 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 24.48 GB/s (RDMA), 80.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 24.51 GB/s (RDMA), 80.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 24.44 GB/s (RDMA), 80.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 16.38 GB/s (RDMA), 54.03 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 20.71 GB/s (RDMA), 68.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 22.90 GB/s (RDMA), 75.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 24.62 GB/s (RDMA), 81.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 24.55 GB/s (RDMA), 81.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 24.48 GB/s (RDMA), 80.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 24.48 GB/s (RDMA), 80.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 24.45 GB/s (RDMA), 80.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 16.38 GB/s (RDMA), 54.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 20.70 GB/s (RDMA), 68.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 22.86 GB/s (RDMA), 75.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 24.58 GB/s (RDMA), 81.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 24.59 GB/s (RDMA), 81.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 24.44 GB/s (RDMA), 80.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 24.46 GB/s (RDMA), 80.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 24.41 GB/s (RDMA), 80.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 16.38 GB/s (RDMA), 54.03 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 20.72 GB/s (RDMA), 68.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 22.87 GB/s (RDMA), 75.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 24.63 GB/s (RDMA), 81.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 24.52 GB/s (RDMA), 80.90 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 24.50 GB/s (RDMA), 80.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 24.46 GB/s (RDMA), 80.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 24.42 GB/s (RDMA), 80.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 16.36 GB/s (RDMA), 53.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 20.69 GB/s (RDMA), 68.26 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 22.88 GB/s (RDMA), 75.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 24.63 GB/s (RDMA), 81.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 24.52 GB/s (RDMA), 80.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 24.39 GB/s (RDMA), 80.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 24.50 GB/s (RDMA), 80.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 24.42 GB/s (RDMA), 80.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 16.43 GB/s (RDMA), 54.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 20.72 GB/s (RDMA), 68.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 22.88 GB/s (RDMA), 75.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 24.63 GB/s (RDMA), 81.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 24.62 GB/s (RDMA), 81.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 24.47 GB/s (RDMA), 80.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 24.42 GB/s (RDMA), 80.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 24.35 GB/s (RDMA), 80.34 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 4, RDMA chunk 20: 24.64 GB/s (RDMA), 81.30 GB/s (NVL)

[tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 20.47 GB/s (RDMA), 67.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 23.02 GB/s (RDMA), 75.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 24.48 GB/s (RDMA), 80.77 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 24.41 GB/s (RDMA), 80.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 24.30 GB/s (RDMA), 80.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 24.20 GB/s (RDMA), 79.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 24.16 GB/s (RDMA), 79.69 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 20.81 GB/s (RDMA), 68.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 23.00 GB/s (RDMA), 75.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 24.48 GB/s (RDMA), 80.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 24.44 GB/s (RDMA), 80.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 24.36 GB/s (RDMA), 80.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 24.27 GB/s (RDMA), 80.05 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 24.24 GB/s (RDMA), 79.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 20.81 GB/s (RDMA), 68.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 22.98 GB/s (RDMA), 75.79 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 24.43 GB/s (RDMA), 80.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 24.43 GB/s (RDMA), 80.59 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 24.35 GB/s (RDMA), 80.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 24.24 GB/s (RDMA), 79.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 24.22 GB/s (RDMA), 79.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 20.85 GB/s (RDMA), 68.78 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.96 GB/s (RDMA), 75.74 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 24.42 GB/s (RDMA), 80.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 24.44 GB/s (RDMA), 80.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 24.32 GB/s (RDMA), 80.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 24.22 GB/s (RDMA), 79.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 24.20 GB/s (RDMA), 79.84 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 16: 24.48 GB/s (RDMA), 80.77 GB/s (NVL)

@whybeyoung Have you also configured these two environment variables? NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"

Can you show me your machine environment? command: nvidia-smi topo -m ibv_devinfo

[root@maas-h20-007 ~]# nvidia-smi topo -m 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    SYS     SYS     0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    SYS     SYS     0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    SYS     SYS     0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    PIX     SYS     SYS     0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     PIX     NODE    48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     NODE    NODE    48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     NODE    PIX     48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     NODE    NODE    48-95,144-191   1               N/A
NIC0    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC1    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS      X      NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3

[root@maas-h20-007 ~]# ibv_devinfo 
hca_id: mlx5_bond_0
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3920
        node_guid:                      e09d:7303:0074:6f94
        sys_image_guid:                 e09d:7303:0074:6f94
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_1
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3920
        node_guid:                      e09d:7303:0074:038c
        sys_image_guid:                 e09d:7303:0074:038c
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_2
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3920
        node_guid:                      e09d:7303:0095:135e
        sys_image_guid:                 e09d:7303:0095:135e
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_3
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3920
        node_guid:                      e09d:7303:0074:03b8
        sys_image_guid:                 e09d:7303:0074:03b8
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

whybeyoung avatar May 06 '25 17:05 whybeyoung

I found a way to fix the problem.

config env:

NVSHMEM_ENABLE_NIC_PE_MAPPING=1
NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"

hack in nvshmem :

int nvshmemi_setup_connections(nvshmemi_state_t *state) {
    int status = 0;
    nvshmem_transport_t *transports = (nvshmem_transport_t *)state->transports;
    nvshmem_transport_t tcurr;
    int savedDev = 0;
    cudaError_t ret = cudaSuccess;

    for (int i = 0; i < state->num_initialized_transports; i++) {
        if (!((state->transport_bitmap) & (1 << i))) continue;
        tcurr = transports[i];

        if (!(tcurr->attr & NVSHMEM_TRANSPORT_ATTR_CONNECTED)) {
            continue;
        }

        int devices_temp = tcurr->n_devices / state->npes_node;
        if (devices_temp == 0) devices_temp = 1;
        const int max_devices_per_pe = devices_temp;
        int selected_devices[max_devices_per_pe];
        int found_devices = 0;

        for (int j = 0; j < max_devices_per_pe; j++) {
            selected_devices[j] = -1;
        }

        // assumes symmetry of transport list at all PEs
        if (tcurr->n_devices <= 1) {
            /* return the index of the first available device.
             * -1 if no devices found.
             */
            selected_devices[0] = tcurr->n_devices - 1;
            found_devices++;
        } else if (nvshmemi_options.ENABLE_NIC_PE_MAPPING) {
            selected_devices[0] =
                nvshmemi_state->mype_node % (tcurr->n_devices > 0 ? tcurr->n_devices : 1);
            ret = cudaGetDevice(&savedDev);
            if (ret != cudaSuccess) {
                status = -3;
                goto out;
            }

            selected_devices[0]  = savedDev % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); // fix in here
            INFO(NVSHMEM_INIT,
                    "pid:[%d] NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, savedDev: %d, setting dev_id = %d", getpid(), savedDev, selected_devices[0]);
            INFO(NVSHMEM_INIT, "NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, setting dev_id = %d",
                 selected_devices[0]);
            found_devices++;
        } else {
            nvshmemi_get_devices_by_distance(selected_devices, max_devices_per_pe, tcurr);
            for (int i = 0; i < max_devices_per_pe; i++) {
                if (selected_devices[i] == -1) {
                    break;
                }
                found_devices++;
                INFO(NVSHMEM_INIT,
                     "NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device %d setting dev_id = %d", i,
                     selected_devices[i]);
            }
        }

        /* setting n_devices to 0 is the transports way of
         * letting us know it's managing devices internally.
         */
        if (tcurr->n_devices > 0 && selected_devices[0] == -1) {
            NVSHMEMI_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "No devices selected.\n");
        }

        status = tcurr->host_ops.connect_endpoints(tcurr, selected_devices, found_devices);
        NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "connect EPS failed \n");
        status = nvshmemi_boot_handle.barrier(&nvshmemi_boot_handle);
        NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "barrier failed \n");

        status = nvshmemi_update_device_state();
    }

out:
    return status;
}

alpha-baby avatar May 08 '25 09:05 alpha-baby

@alpha-baby Hi, I'm not very familiar with bonded NICs. Would it be possible to add me on WeChat: Sphizzz? I have a few questions I'd like to ask.

sphish avatar May 09 '25 07:05 sphish

I found a way to fix the problem.

config env:

NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2" hack in nvshmem :

int nvshmemi_setup_connections(nvshmemi_state_t *state) { int status = 0; nvshmem_transport_t *transports = (nvshmem_transport_t *)state->transports; nvshmem_transport_t tcurr; int savedDev = 0; cudaError_t ret = cudaSuccess;

for (int i = 0; i < state->num_initialized_transports; i++) {
    if (!((state->transport_bitmap) & (1 << i))) continue;
    tcurr = transports[i];

    if (!(tcurr->attr & NVSHMEM_TRANSPORT_ATTR_CONNECTED)) {
        continue;
    }

    int devices_temp = tcurr->n_devices / state->npes_node;
    if (devices_temp == 0) devices_temp = 1;
    const int max_devices_per_pe = devices_temp;
    int selected_devices[max_devices_per_pe];
    int found_devices = 0;

    for (int j = 0; j < max_devices_per_pe; j++) {
        selected_devices[j] = -1;
    }

    // assumes symmetry of transport list at all PEs
    if (tcurr->n_devices <= 1) {
        /* return the index of the first available device.
         * -1 if no devices found.
         */
        selected_devices[0] = tcurr->n_devices - 1;
        found_devices++;
    } else if (nvshmemi_options.ENABLE_NIC_PE_MAPPING) {
        selected_devices[0] =
            nvshmemi_state->mype_node % (tcurr->n_devices > 0 ? tcurr->n_devices : 1);
        ret = cudaGetDevice(&savedDev);
        if (ret != cudaSuccess) {
            status = -3;
            goto out;
        }

        selected_devices[0]  = savedDev % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); // fix in here
        INFO(NVSHMEM_INIT,
                "pid:[%d] NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, savedDev: %d, setting dev_id = %d", getpid(), savedDev, selected_devices[0]);
        INFO(NVSHMEM_INIT, "NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, setting dev_id = %d",
             selected_devices[0]);
        found_devices++;
    } else {
        nvshmemi_get_devices_by_distance(selected_devices, max_devices_per_pe, tcurr);
        for (int i = 0; i < max_devices_per_pe; i++) {
            if (selected_devices[i] == -1) {
                break;
            }
            found_devices++;
            INFO(NVSHMEM_INIT,
                 "NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device %d setting dev_id = %d", i,
                 selected_devices[i]);
        }
    }

    /* setting n_devices to 0 is the transports way of
     * letting us know it's managing devices internally.
     */
    if (tcurr->n_devices > 0 && selected_devices[0] == -1) {
        NVSHMEMI_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "No devices selected.\n");
    }

    status = tcurr->host_ops.connect_endpoints(tcurr, selected_devices, found_devices);
    NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "connect EPS failed \n");
    status = nvshmemi_boot_handle.barrier(&nvshmemi_boot_handle);
    NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "barrier failed \n");

    status = nvshmemi_update_device_state();
}

out: return status; }

your solution not works for me

whybeyoung avatar May 11 '25 10:05 whybeyoung

Hi there, could you please try this branch https://github.com/deepseek-ai/DeepEP/tree/try_fix_roce_mqp and see if it resolves the issue?

yes this resolve my test timeout problem

whybeyoung avatar May 11 '25 10:05 whybeyoung

https://github.com/deepseek-ai/DeepEP/tree/try_fix_roce_mqp

@sphish May I ask if this change will be incorporated into the main branch?

polarstormx avatar May 13 '25 09:05 polarstormx

@polarstormx Have you tested this modification and confirmed its effectiveness? Since I haven't been able to reproduce this issue internally, I need to collect some feedback.

sphish avatar May 14 '25 01:05 sphish

I found a way to fix the problem.

config env:

NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2" hack in nvshmem :

int nvshmemi_setup_connections(nvshmemi_state_t *state) { int status = 0; nvshmem_transport_t *transports = (nvshmem_transport_t *)state->transports; nvshmem_transport_t tcurr; int savedDev = 0; cudaError_t ret = cudaSuccess;

for (int i = 0; i < state->num_initialized_transports; i++) {
    if (!((state->transport_bitmap) & (1 << i))) continue;
    tcurr = transports[i];

    if (!(tcurr->attr & NVSHMEM_TRANSPORT_ATTR_CONNECTED)) {
        continue;
    }

    int devices_temp = tcurr->n_devices / state->npes_node;
    if (devices_temp == 0) devices_temp = 1;
    const int max_devices_per_pe = devices_temp;
    int selected_devices[max_devices_per_pe];
    int found_devices = 0;

    for (int j = 0; j < max_devices_per_pe; j++) {
        selected_devices[j] = -1;
    }

    // assumes symmetry of transport list at all PEs
    if (tcurr->n_devices <= 1) {
        /* return the index of the first available device.
         * -1 if no devices found.
         */
        selected_devices[0] = tcurr->n_devices - 1;
        found_devices++;
    } else if (nvshmemi_options.ENABLE_NIC_PE_MAPPING) {
        selected_devices[0] =
            nvshmemi_state->mype_node % (tcurr->n_devices > 0 ? tcurr->n_devices : 1);
        ret = cudaGetDevice(&savedDev);
        if (ret != cudaSuccess) {
            status = -3;
            goto out;
        }

        selected_devices[0]  = savedDev % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); // fix in here
        INFO(NVSHMEM_INIT,
                "pid:[%d] NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, savedDev: %d, setting dev_id = %d", getpid(), savedDev, selected_devices[0]);
        INFO(NVSHMEM_INIT, "NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, setting dev_id = %d",
             selected_devices[0]);
        found_devices++;
    } else {
        nvshmemi_get_devices_by_distance(selected_devices, max_devices_per_pe, tcurr);
        for (int i = 0; i < max_devices_per_pe; i++) {
            if (selected_devices[i] == -1) {
                break;
            }
            found_devices++;
            INFO(NVSHMEM_INIT,
                 "NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device %d setting dev_id = %d", i,
                 selected_devices[i]);
        }
    }

    /* setting n_devices to 0 is the transports way of
     * letting us know it's managing devices internally.
     */
    if (tcurr->n_devices > 0 && selected_devices[0] == -1) {
        NVSHMEMI_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "No devices selected.\n");
    }

    status = tcurr->host_ops.connect_endpoints(tcurr, selected_devices, found_devices);
    NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "connect EPS failed \n");
    status = nvshmemi_boot_handle.barrier(&nvshmemi_boot_handle);
    NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "barrier failed \n");

    status = nvshmemi_update_device_state();
}

out: return status; }

not work on H20, tested with latest DeepEP: bb393e7760f94eb93878f4d62d967a58bd2d777d

cscyuge avatar May 14 '25 03:05 cscyuge

@cscyuge Can you show me your machine environment? command: nvidia-smi topo -m ibv_devinfo

alpha-baby avatar May 14 '25 07:05 alpha-baby

@cscyuge Can you show me your machine environment? command: nvidia-smi topo -m ibv_devinfo

nvidia-smi topo -m:

       GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     PHB     NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    PHB     PIX     NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    96-191,288-383  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    96-191,288-383  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PHB     96-191,288-383  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    PHB     PIX     96-191,288-383  1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC1    NODE    PIX     PHB     NODE    SYS     SYS     SYS     SYS     NODE     X      PHB     NODE    SYS     SYS     SYS     SYS
NIC2    NODE    PHB     PIX     NODE    SYS     SYS     SYS     SYS     NODE    PHB      X      NODE    SYS     SYS     SYS     SYS
NIC3    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE
NIC5    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PHB     SYS     SYS     SYS     SYS     NODE    NODE     X      PHB
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    PHB     PIX     SYS     SYS     SYS     SYS     NODE    NODE    PHB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
  NIC4: mlx5_bond_4
  NIC5: mlx5_bond_5
  NIC6: mlx5_bond_6
  NIC7: mlx5_bond_7

ibv_devinfo:

hca_id: mlx5_bond_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.39.1002
        node_guid:                      5c25:7303:0094:212a
        sys_image_guid:                 5c25:7303:0094:212a
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000834
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_1
        transport:                      InfiniBand (0)
        fw_ver:                         28.39.1002
        node_guid:                      5c25:7303:0094:3566
        sys_image_guid:                 5c25:7303:0094:3566
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000834
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_2
        transport:                      InfiniBand (0)
        fw_ver:                         28.39.1002
        node_guid:                      5c25:7303:0094:4256
        sys_image_guid:                 5c25:7303:0094:4256
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000834
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_3
        transport:                      InfiniBand (0)
        fw_ver:                         28.39.1002
        node_guid:                      5c25:7303:0094:214a
        sys_image_guid:                 5c25:7303:0094:214a
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000834
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_4
        transport:                      InfiniBand (0)
        fw_ver:                         28.39.1002
        node_guid:                      5c25:7303:0094:284a
        sys_image_guid:                 5c25:7303:0094:284a
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000834
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_5
        transport:                      InfiniBand (0)
        fw_ver:                         28.39.1002
        node_guid:                      5c25:7303:0094:213a
        sys_image_guid:                 5c25:7303:0094:213a
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000834
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_6
        transport:                      InfiniBand (0)
        fw_ver:                         28.39.1002
        node_guid:                      5c25:7303:0094:23ba
        sys_image_guid:                 5c25:7303:0094:23ba
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000834
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_7
        transport:                      InfiniBand (0)
        fw_ver:                         28.39.1002
        node_guid:                      5c25:7303:0094:288a
        sys_image_guid:                 5c25:7303:0094:288a
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000834
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

cscyuge avatar May 14 '25 08:05 cscyuge

Your machine topo environment is different from mine, mine only has four network cards. My patch should not apply to you, so you don't need to configure NVSHMEM_ENABLE_NIC_PE_MAPPING configuration.

you should config: NVSHMEM_ENABLE_NIC_PE_MAPPING=0 @cscyuge

alpha-baby avatar May 14 '25 09:05 alpha-baby

Your machine topo environment is different from mine, mine only has four network cards. My patch should not apply to you, so you don't need to configure NVSHMEM_ENABLE_NIC_PE_MAPPING configuration.

you should config: NVSHMEM_ENABLE_NIC_PE_MAPPING=0 @cscyuge

I have applied the change of nvshmemi_setup_connections and tried 4 8*H20 nodes with:

# node 0
NVSHMEM_ENABLE_NIC_PE_MAPPING=0 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2,mlx5_bond_4:1:2,mlx5_bond_5:1:2,mlx5_bond_6:1:2,mlx5_bond_7:1:2" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=0 python test_internode.py
# node 1
NVSHMEM_ENABLE_NIC_PE_MAPPING=0 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2,mlx5_bond_4:1:2,mlx5_bond_5:1:2,mlx5_bond_6:1:2,mlx5_bond_7:1:2" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=1 python test_internode.py
# node 2
NVSHMEM_ENABLE_NIC_PE_MAPPING=0 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2,mlx5_bond_4:1:2,mlx5_bond_5:1:2,mlx5_bond_6:1:2,mlx5_bond_7:1:2" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=2 python test_internode.py
# node 3
NVSHMEM_ENABLE_NIC_PE_MAPPING=0 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2,mlx5_bond_4:1:2,mlx5_bond_5:1:2,mlx5_bond_6:1:2,mlx5_bond_7:1:2" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=3 python test_internode.py

I got the same timeout error running with the script above.

And I tried 2 8*H20 nodes, got another error message:

...
[rank 3] Dispatch send/recv time: 1001.35 us | Combine send/recv time: 1182.14 us
[rank 0] Dispatch send/recv time: 153.66 us | Combine send/recv time: 195.56 us
[rank 5] Dispatch send/recv time: 96.00 us | Combine send/recv time: 104.76 us
[rank 1] Dispatch send/recv time: 18.15 us | Combine send/recv time: 20.66 us
[rank 7] Dispatch send/recv time: 268.17 us | Combine send/recv time: 308.93 us
[rank 6] Dispatch send/recv time: 19.01 us | Combine send/recv time: 20.84 us
[rank 2] Dispatch send/recv time: 269.79 us | Combine send/recv time: 337.55 us
[rank 4] Dispatch send/recv time: 18.34 us | Combine send/recv time: 20.84 us
[VM-12-10-centos:87868:0:87868] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x561aef362)
[VM-12-10-centos:87866:0:87866] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55ae9bb36)
[VM-12-10-centos:87867:0:87867] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x559eddbd4)
[VM-12-10-centos:87863:0:87863] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55ff4cc52)
[VM-12-10-centos:87869:0:87869] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x5628dadb0)
[VM-12-10-centos:87865:0:87865] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x56549fda5)
[VM-12-10-centos:87864:0:87864] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55613f4f7)
[VM-12-10-centos:87862:0:87862] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x5577c60e0)
^@==== backtrace (tid:  87863) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000019d57 ibv_dealloc_pd()  ???:0
 2 0x000000000000ce6d nvshmemt_ibrc_finalize()  :0
 3 0x0000000000220ab2 nvshmemi_transport_finalize()  ???:0
 4 0x00000000000b49f9 nvshmemid_hostlib_finalize()  ???:0
 5 0x00000000001b301f nvshmemi_finalize()  ???:0
 6 0x0000000000055252 deep_ep::Buffer::~Buffer()  /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
 7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
 8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
 9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc()  /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance()  :0
11 0x00000000005174d1 pybind11_object_dealloc()  :0
12 0x0000000000169b93 _Py_CheckFunctionResult()  ???:0
13 0x00000000001a2407 PyObject_DelItem()  ???:0
14 0x0000000000181370 PyMapping_Check()  ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall()  ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
19 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
23 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault()  ???:0
25 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault()  ???:0
27 0x0000000000259f56 PyEval_EvalCode()  ???:0
28 0x0000000000259e26 PyEval_EvalCode()  ???:0
29 0x0000000000280808 PyUnicode_Tailmatch()  ???:0
30 0x000000000027b00f PyInit__collections()  ???:0
31 0x0000000000274d91 PyRun_StringFlags()  ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags()  ???:0
33 0x0000000000273f70 Py_RunMain()  ???:0
34 0x000000000024de6d Py_BytesMain()  ???:0
35 0x0000000000029d90 __libc_init_first()  ???:0
36 0x0000000000029e40 __libc_start_main()  ???:0
37 0x000000000024dd65 _start()  ???:0
=================================
==== backtrace (tid:  87866) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000019d57 ibv_dealloc_pd()  ???:0
 2 0x000000000000ce6d nvshmemt_ibrc_finalize()  :0
 3 0x0000000000220ab2 nvshmemi_transport_finalize()  ???:0
 4 0x00000000000b49f9 nvshmemid_hostlib_finalize()  ???:0
 5 0x00000000001b301f nvshmemi_finalize()  ???:0
 6 0x0000000000055252 deep_ep::Buffer::~Buffer()  /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
 7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
 8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
 9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc()  /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance()  :0
11 0x00000000005174d1 pybind11_object_dealloc()  :0
12 0x0000000000169b93 _Py_CheckFunctionResult()  ???:0
13 0x00000000001a2407 PyObject_DelItem()  ???:0
14 0x0000000000181370 PyMapping_Check()  ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall()  ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
19 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
23 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault()  ???:0
25 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault()  ???:0
27 0x0000000000259f56 PyEval_EvalCode()  ???:0
28 0x0000000000259e26 PyEval_EvalCode()  ???:0
29 0x0000000000280808 PyUnicode_Tailmatch()  ???:0
30 0x000000000027b00f PyInit__collections()  ???:0
31 0x0000000000274d91 PyRun_StringFlags()  ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags()  ???:0
33 0x0000000000273f70 Py_RunMain()  ???:0
34 0x000000000024de6d Py_BytesMain()  ???:0
35 0x0000000000029d90 __libc_init_first()  ???:0
36 0x0000000000029e40 __libc_start_main()  ???:0
37 0x000000000024dd65 _start()  ???:0
=================================
==== backtrace (tid:  87868) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000019d57 ibv_dealloc_pd()  ???:0
 2 0x000000000000ce6d nvshmemt_ibrc_finalize()  :0
 3 0x0000000000220ab2 nvshmemi_transport_finalize()  ???:0
 4 0x00000000000b49f9 nvshmemid_hostlib_finalize()  ???:0
 5 0x00000000001b301f nvshmemi_finalize()  ???:0
 6 0x0000000000055252 deep_ep::Buffer::~Buffer()  /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
 7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
 8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
 9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc()  /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance()  :0
11 0x00000000005174d1 pybind11_object_dealloc()  :0
12 0x0000000000169b93 _Py_CheckFunctionResult()  ???:0
13 0x00000000001a2407 PyObject_DelItem()  ???:0
14 0x0000000000181370 PyMapping_Check()  ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall()  ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
19 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
23 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault()  ???:0
25 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault()  ???:0
27 0x0000000000259f56 PyEval_EvalCode()  ???:0
28 0x0000000000259e26 PyEval_EvalCode()  ???:0
29 0x0000000000280808 PyUnicode_Tailmatch()  ???:0
30 0x000000000027b00f PyInit__collections()  ???:0
31 0x0000000000274d91 PyRun_StringFlags()  ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags()  ???:0
33 0x0000000000273f70 Py_RunMain()  ???:0
34 0x000000000024de6d Py_BytesMain()  ???:0
35 0x0000000000029d90 __libc_init_first()  ???:0
36 0x0000000000029e40 __libc_start_main()  ???:0
37 0x000000000024dd65 _start()  ???:0
=================================
==== backtrace (tid:  87865) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000019d57 ibv_dealloc_pd()  ???:0
 2 0x000000000000ce6d nvshmemt_ibrc_finalize()  :0
 3 0x0000000000220ab2 nvshmemi_transport_finalize()  ???:0
 4 0x00000000000b49f9 nvshmemid_hostlib_finalize()  ???:0
 5 0x00000000001b301f nvshmemi_finalize()  ???:0
 6 0x0000000000055252 deep_ep::Buffer::~Buffer()  /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
 7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
 8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
 9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc()  /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance()  :0
11 0x00000000005174d1 pybind11_object_dealloc()  :0
12 0x0000000000169b93 _Py_CheckFunctionResult()  ???:0
13 0x00000000001a2407 PyObject_DelItem()  ???:0
14 0x0000000000181370 PyMapping_Check()  ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall()  ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
19 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
23 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault()  ???:0
25 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault()  ???:0
27 0x0000000000259f56 PyEval_EvalCode()  ???:0
28 0x0000000000259e26 PyEval_EvalCode()  ???:0
29 0x0000000000280808 PyUnicode_Tailmatch()  ???:0
30 0x000000000027b00f PyInit__collections()  ???:0
31 0x0000000000274d91 PyRun_StringFlags()  ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags()  ???:0
33 0x0000000000273f70 Py_RunMain()  ???:0
34 0x000000000024de6d Py_BytesMain()  ???:0
35 0x0000000000029d90 __libc_init_first()  ???:0
36 0x0000000000029e40 __libc_start_main()  ???:0
37 0x000000000024dd65 _start()  ???:0
=================================
==== backtrace (tid:  87862) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000001a0d6 ibv_dereg_mr()  ???:0
 2 0x000000000000cddc nvshmemt_ibrc_finalize()  :0
 3 0x0000000000220ab2 nvshmemi_transport_finalize()  ???:0
 4 0x00000000000b49f9 nvshmemid_hostlib_finalize()  ???:0
 5 0x00000000001b301f nvshmemi_finalize()  ???:0
 6 0x0000000000055252 deep_ep::Buffer::~Buffer()  /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
 7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
 8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
 9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc()  /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance()  :0
11 0x00000000005174d1 pybind11_object_dealloc()  :0
12 0x0000000000169b93 _Py_CheckFunctionResult()  ???:0
13 0x00000000001a2407 PyObject_DelItem()  ???:0
14 0x0000000000181370 PyMapping_Check()  ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall()  ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
19 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
23 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault()  ???:0
25 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault()  ???:0
27 0x0000000000259f56 PyEval_EvalCode()  ???:0
28 0x0000000000259e26 PyEval_EvalCode()  ???:0
29 0x0000000000280808 PyUnicode_Tailmatch()  ???:0
30 0x000000000027b00f PyInit__collections()  ???:0
31 0x0000000000274d91 PyRun_StringFlags()  ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags()  ???:0
33 0x0000000000273f70 Py_RunMain()  ???:0
34 0x000000000024de6d Py_BytesMain()  ???:0
35 0x0000000000029d90 __libc_init_first()  ???:0
36 0x0000000000029e40 __libc_start_main()  ???:0
37 0x000000000024dd65 _start()  ???:0
=================================
==== backtrace (tid:  87869) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000019d57 ibv_dealloc_pd()  ???:0
 2 0x000000000000ce6d nvshmemt_ibrc_finalize()  :0
 3 0x0000000000220ab2 nvshmemi_transport_finalize()  ???:0
 4 0x00000000000b49f9 nvshmemid_hostlib_finalize()  ???:0
 5 0x00000000001b301f nvshmemi_finalize()  ???:0
 6 0x0000000000055252 deep_ep::Buffer::~Buffer()  /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
 7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
 8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
 9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc()  /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance()  :0
11 0x00000000005174d1 pybind11_object_dealloc()  :0
12 0x0000000000169b93 _Py_CheckFunctionResult()  ???:0
13 0x00000000001a2407 PyObject_DelItem()  ???:0
14 0x0000000000181370 PyMapping_Check()  ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall()  ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
19 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
23 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault()  ???:0
25 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault()  ???:0
27 0x0000000000259f56 PyEval_EvalCode()  ???:0
28 0x0000000000259e26 PyEval_EvalCode()  ???:0
29 0x0000000000280808 PyUnicode_Tailmatch()  ???:0
30 0x000000000027b00f PyInit__collections()  ???:0
31 0x0000000000274d91 PyRun_StringFlags()  ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags()  ???:0
33 0x0000000000273f70 Py_RunMain()  ???:0
34 0x000000000024de6d Py_BytesMain()  ???:0
35 0x0000000000029d90 __libc_init_first()  ???:0
36 0x0000000000029e40 __libc_start_main()  ???:0
37 0x000000000024dd65 _start()  ???:0
=================================
==== backtrace (tid:  87864) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000019d57 ibv_dealloc_pd()  ???:0
 2 0x000000000000ce6d nvshmemt_ibrc_finalize()  :0
 3 0x0000000000220ab2 nvshmemi_transport_finalize()  ???:0
 4 0x00000000000b49f9 nvshmemid_hostlib_finalize()  ???:0
 5 0x00000000001b301f nvshmemi_finalize()  ???:0
 6 0x0000000000055252 deep_ep::Buffer::~Buffer()  /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
 7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
 8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
 9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc()  /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance()  :0
11 0x00000000005174d1 pybind11_object_dealloc()  :0
12 0x0000000000169b93 _Py_CheckFunctionResult()  ???:0
13 0x00000000001a2407 PyObject_DelItem()  ???:0
14 0x0000000000181370 PyMapping_Check()  ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall()  ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
19 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
23 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault()  ???:0
25 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault()  ???:0
27 0x0000000000259f56 PyEval_EvalCode()  ???:0
28 0x0000000000259e26 PyEval_EvalCode()  ???:0
29 0x0000000000280808 PyUnicode_Tailmatch()  ???:0
30 0x000000000027b00f PyInit__collections()  ???:0
31 0x0000000000274d91 PyRun_StringFlags()  ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags()  ???:0
33 0x0000000000273f70 Py_RunMain()  ???:0
34 0x000000000024de6d Py_BytesMain()  ???:0
35 0x0000000000029d90 __libc_init_first()  ???:0
36 0x0000000000029e40 __libc_start_main()  ???:0
37 0x000000000024dd65 _start()  ???:0
=================================
==== backtrace (tid:  87867) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000019d57 ibv_dealloc_pd()  ???:0
 2 0x000000000000ce6d nvshmemt_ibrc_finalize()  :0
 3 0x0000000000220ab2 nvshmemi_transport_finalize()  ???:0
 4 0x00000000000b49f9 nvshmemid_hostlib_finalize()  ???:0
 5 0x00000000001b301f nvshmemi_finalize()  ???:0
 6 0x0000000000055252 deep_ep::Buffer::~Buffer()  /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
 7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()()  /usr/include/c++/11/bits/unique_ptr.h:85
 8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr()  /usr/include/c++/11/bits/unique_ptr.h:361
 9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc()  /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance()  :0
11 0x00000000005174d1 pybind11_object_dealloc()  :0
12 0x0000000000169b93 _Py_CheckFunctionResult()  ???:0                                                                                                                                                  [0/1824]
13 0x00000000001a2407 PyObject_DelItem()  ???:0
14 0x0000000000181370 PyMapping_Check()  ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall()  ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault()  ???:0
19 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault()  ???:0
23 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault()  ???:0
25 0x000000000018b66c _PyFunction_Vectorcall()  ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault()  ???:0
27 0x0000000000259f56 PyEval_EvalCode()  ???:0
28 0x0000000000259e26 PyEval_EvalCode()  ???:0
29 0x0000000000280808 PyUnicode_Tailmatch()  ???:0
30 0x000000000027b00f PyInit__collections()  ???:0
31 0x0000000000274d91 PyRun_StringFlags()  ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags()  ???:0
33 0x0000000000273f70 Py_RunMain()  ???:0
34 0x000000000024de6d Py_BytesMain()  ???:0
35 0x0000000000029d90 __libc_init_first()  ???:0
36 0x0000000000029e40 __libc_start_main()  ???:0
37 0x000000000024dd65 _start()  ???:0
=================================
W0514 11:36:50.744000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87862 via signal SIGTERM
W0514 11:36:50.744000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87863 via signal SIGTERM
W0514 11:36:50.744000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87864 via signal SIGTERM
W0514 11:36:50.745000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87865 via signal SIGTERM
W0514 11:36:50.745000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87866 via signal SIGTERM
W0514 11:36:50.745000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87867 via signal SIGTERM
W0514 11:36:50.745000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87869 via signal SIGTERM
Traceback (most recent call last):
  File "/mnt/yscfs/linjunxian/DeepEP/tests/test_internode.py", line 247, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 196, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with signal SIGSEGV

If I remove NVSHMEM_ENABLE_NIC_PE_MAPPING and NVSHMEM_HCA_PE_MAPPING, 2 8*H20 nodes can pass test_internode.py (can also pass without the change of nvshmemi_setup_connections ), but 4 nodes still timeout.

cscyuge avatar May 14 '25 11:05 cscyuge

you should config:

# node 0
NVSHMEM_ENABLE_NIC_PE_MAPPING=1
NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:1,mlx5_bond_1:1:1,mlx5_bond_2:1:1,mlx5_bond_3:1:1,mlx5_bond_4:1:1,mlx5_bond_5:1:1,mlx5_bond_6:1:1,mlx5_bond_7:1:1" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=0 python test_internode.py
# node 1
NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:1,mlx5_bond_1:1:1,mlx5_bond_2:1:1,mlx5_bond_3:1:1,mlx5_bond_4:1:1,mlx5_bond_5:1:1,mlx5_bond_6:1:1,mlx5_bond_7:1:1" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=1 python test_internode.py
# node 2
NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:1,mlx5_bond_1:1:1,mlx5_bond_2:1:1,mlx5_bond_3:1:1,mlx5_bond_4:1:1,mlx5_bond_5:1:1,mlx5_bond_6:1:1,mlx5_bond_7:1:1" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=2 python test_internode.py
# node 3
NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:1,mlx5_bond_1:1:1,mlx5_bond_2:1:1,mlx5_bond_3:1:1,mlx5_bond_4:1:1,mlx5_bond_5:1:1,mlx5_bond_6:1:1,mlx5_bond_7:1:1" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=3 python test_internode.py

alpha-baby avatar May 15 '25 02:05 alpha-baby