[test_internode.py] failed on multi-QP: dispatch timeout on ROCE network with testing 2*H20 nodes
When I run the across-node test with MASTER_ADDR=<ip> MASTER_PORT=30001 WORLD_SIZE=2 RANK=0 python test_internode.py on 2*H20 nodes, I got the following timeout log:
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 0, nvl: 4, src RDMA lane: 1, dst NVL: 2, meta: 0, 0, 0, 0
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f718176c446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f71817166e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7181b73a18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f7181b3a92e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f7181b3ba57 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f7181b3bc5f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f718059af70 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f718174d69f in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f718174637b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7181746529 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f7180861a98 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f7180861de6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x181758 (0x5570898a4758 in /usr/bin/python)
frame #13: <unknown function> + 0x1949e8 (0x5570898b79e8 in /usr/bin/python)
frame #14: <unknown function> + 0x1949fc (0x5570898b79fc in /usr/bin/python)
frame #15: <unknown function> + 0x1949fc (0x5570898b79fc in /usr/bin/python)
frame #16: <unknown function> + 0x1a08bf (0x5570898c38bf in /usr/bin/python)
frame #17: <unknown function> + 0x15f9d6 (0x5570898829d6 in /usr/bin/python)
frame #18: <unknown function> + 0x2941a7 (0x5570899b71a7 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x5757 (0x55708989da27 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x818 (0x557089898ae8 in /usr/bin/python)
frame #22: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x6d2 (0x5570898989a2 in /usr/bin/python)
frame #24: _PyFunction_Vectorcall + 0x7c (0x5570898aeaec in /usr/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x1a22 (0x557089899cf2 in /usr/bin/python)
frame #26: <unknown function> + 0x25ae56 (0x55708997de56 in /usr/bin/python)
frame #27: PyEval_EvalCode + 0x86 (0x55708997dd26 in /usr/bin/python)
frame #28: <unknown function> + 0x281ae8 (0x5570899a4ae8 in /usr/bin/python)
frame #29: <unknown function> + 0x27c2ef (0x55708999f2ef in /usr/bin/python)
frame #30: PyRun_StringFlags + 0x81 (0x557089998f61 in /usr/bin/python)
frame #31: PyRun_SimpleStringFlags + 0x41 (0x557089998e11 in /usr/bin/python)
frame #32: Py_RunMain + 0x3d0 (0x557089998140 in /usr/bin/python)
frame #33: Py_BytesMain + 0x2d (0x557089971d6d in /usr/bin/python)
frame #34: <unknown function> + 0x29d90 (0x7f7182671d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: __libc_start_main + 0x80 (0x7f7182671e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x25 (0x557089971c65 in /usr/bin/python)
This issue only happens after the Multi-QP patch: https://github.com/deepseek-ai/DeepEP/commit/5ab80c28f3d6c3e4f88ce236f427ab7c81025172 is merged. It's probably related with multi-QP.
Is adaptive routing enabled in your NIC and switch configuration?
This change has been successfully tested in our own IB environment and in RoCE environments of several cloud service providers. However, I have indeed discovered that it fails in some RoCE environments where adaptive routing is enabled.
The AR is OFF in my environment. Is it caused by multi-QP. I tried earlier version and find it's working.
@jeffye-dev try set this var to True
https://github.com/deepseek-ai/DeepEP/commit/5ab80c28f3d6c3e4f88ce236f427ab7c81025172#diff-c77f4e0d77d8fc685ab907f9ad338f0c168b96ad4313c77b6dff9c7faf0713b9R224
https://github.com/deepseek-ai/DeepEP/commit/007fcfcf97914e1f3d661f28dd125e7d1b9f8320#diff-c77f4e0d77d8fc685ab907f9ad338f0c168b96ad4313c77b6dff9c7faf0713b9R222
in the latest version, not allow set test_ll_compatibility=False? @sphish
Because test will be failed when set test_ll_compatibility=False. test will be success when set test_ll_compatibility=True`.
failed log:
/root/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 0 nranks 2 tag 0 - ENTER
/root/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 0 nranks 2 tag 1 - DONE
/root/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 0 nranks 2 tag 0 - ENTER
/root/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 0 nranks 2 tag 1 - DONE
[config] num_tokens=4096, hidden=7168, num_topk_groups=2, num_topk=8
[layout] Kernel performance: 0.050 ms
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 4.11 GB/s (RDMA), 13.47 GB/s (NVL)
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 0, nvl: 2, src RDMA: 1, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 0, nvl: 2, src RDMA lane: 1, dst NVL: 2, meta: 0, 0, 0, 0
...................
@alpha-baby Is this issue occurring intermittently, or does it happen every time you run the test? I haven't encountered this problem in our own testing environment. Have you recompiled the C code?
@alpha-baby Is this issue occurring intermittently, or does it happen every time you run the test? I haven't encountered this problem in our own testing environment. Have you recompiled the C code?
I use this commit code: https://github.com/deepseek-ai/DeepEP/commit/007fcfcf97914e1f3d661f28dd125e7d1b9f8320#diff-c77f4e0d77d8fc685ab907f9ad338f0c168b96ad4313c77b6dff9c7faf0713b9R222
I just modified the variable test_ll_compatibility=False in the test_internode.py file, and didn't recompile the C code, which can always be reproduced in my environment. My test environment uses ROCE.
@sphish
@alpha-baby What I meant is, after switching to this commit, did you recompile the C code? If you haven't done so, you should recompile the C code.
@alpha-baby What I meant is, after switching to this commit, did you recompile the C code? If you haven't done so, you should recompile the C code.
yes, i recompiled the C code. I found that using the new commit really improved the performance.
@alpha-baby It's quite strange. I can't reproduce this issue. Theoretically, setting test_ll_compatibility=False now only changes the way NVSHMEM init group, and shouldn't affect correctness.
I encountered this issue on 2*H800, and it occurs whether test_ll_compatibility is set to True or False. It may happen occasionally, so you can try multiple times to reproduce this issue.
the same error after upgrade to multi-qp relation code
The AR is OFF in my environment. Is it caused by multi-QP. I tried earlier version and find it's working.
the same
log
### RANK0
tests/test_internode.py
[config] num_tokens=4096, hidden=7168, num_topk_groups=2, num_topk=8
[layout] Kernel performance: 0.075 ms
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 6.81 GB/s (RDMA), 22.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 9.22 GB/s (RDMA), 30.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 15.80 GB/s (RDMA), 51.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 20.66 GB/s (RDMA), 67.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 21.80 GB/s (RDMA), 71.38 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 22.44 GB/s (RDMA), 73.50 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 23.12 GB/s (RDMA), 75.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 22.81 GB/s (RDMA), 74.70 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 7.25 GB/s (RDMA), 23.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 8.93 GB/s (RDMA), 29.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 13.00 GB/s (RDMA), 42.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 19.47 GB/s (RDMA), 63.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 21.95 GB/s (RDMA), 71.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 22.55 GB/s (RDMA), 73.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 23.15 GB/s (RDMA), 75.79 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 23.35 GB/s (RDMA), 76.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 7.62 GB/s (RDMA), 24.96 GB/s (NVL)
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 0, nvl: 7, src NVL: 2, head: 255, tail: 255
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 0, nvl: 1, src NVL: 2, head: 280, tail: 280
DeepEP timeout check failed: 0 (rank = 3)
DeepEP timeout check failed: 0 (rank = 4)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5f3b56c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5f3b5166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5f3b94ca18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f5f3b91392e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f5f3b914a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f5f3b914c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f5f3a1faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f5f3b54d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f5f3b54637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f5f3b546529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f5f3a4c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f5f3a4c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f5f3c229d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f5f3c229e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x557fa8861095 in /usr/local/bin/python)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1c6516c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f1c651166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f1c65588a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f1c6554f92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f1c65550a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f1c65550c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f1c63dfaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f1c6514d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f1c6514637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f1c65146529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f1c640c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f1c640c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f1c65e29d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f1c65e29e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x5567420e7095 in /usr/local/bin/python)
DeepEP timeout check failed: 0 (rank = 5)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f97ee4b9446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f97ee4636e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f97ee5a5a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f97ee56c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f97ee56da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f97ee56dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f97ed1faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f97ee49a69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f97ee49337b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f97ee493529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f97ed4c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f97ed4c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f97ef029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f97ef029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x5591d2ad9095 in /usr/local/bin/python)
DeepEP timeout check failed: 0 (rank = 0)
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f45b976c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f45b97166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f45b9c0aa18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f45b9bd192e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f45b9bd2a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f45b9bd2c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f45b83faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f45b974d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f45b974637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f45b9746529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f45b86c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f45b86c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f45ba429d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f45ba429e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55a9f1974095 in /usr/local/bin/python)
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbb7176c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbb717166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbb71c12a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fbb71bd992e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fbb71bdaa57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fbb71bdac5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fbb703faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fbb7174d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fbb7174637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fbb71746529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fbb706c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fbb706c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fbb72429d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fbb72429e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55cb7794e095 in /usr/local/bin/python)
terminate called after throwing an instance of 'c10::Error'
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 0, nvl: 2, src NVL: 2, head: 287, tail: 287
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f029c36c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f029c3166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f029c77fa18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f029c74692e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f029c747a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f029c747c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f029b3faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f029c34d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f029c34637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f029c346529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f029b6c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f029b6c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f029d229d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f029d229e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55b74ffbb095 in /usr/local/bin/python)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc15f36c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc15f3166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc15f728a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fc15f6ef92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fc15f6f0a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fc15f6f0c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fc15dffaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fc15f34d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fc15f34637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc15f346529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fc15e2c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fc15e2c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fc160029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fc160029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x56502677a095 in /usr/local/bin/python)
DeepEP timeout check failed: 0 (rank = 6)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbaa036c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbaa03166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbaa070ea18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fbaa06d592e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fbaa06d6a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fbaa06d6c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fba9effaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fbaa034d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fbaa034637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fbaa0346529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fba9f2c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fba9f2c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fbaa1029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fbaa1029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x5560d043d095 in /usr/local/bin/python)
W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 859 via signal SIGTERM
W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 860 via signal SIGTERM
W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 861 via signal SIGTERM
W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 862 via signal SIGTERM
W0501 13:38:45.990000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 864 via signal SIGTERM
W0501 13:38:45.991000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 865 via signal SIGTERM
W0501 13:38:45.991000 794 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 866 via signal SIGTERM
Traceback (most recent call last):
File "/sgl-workspace/DeepEP/tests/test_internode.py", line 247, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 4 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/sgl-workspace/DeepEP/tests/test_internode.py", line 235, in test_loop
test_main(i, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
File "/sgl-workspace/DeepEP/tests/test_internode.py", line 179, in test_main
t = bench(lambda: buffer.dispatch(**tune_args))[0]
File "/sgl-workspace/DeepEP/tests/utils.py", line 81, in bench
fn()
File "/sgl-workspace/DeepEP/tests/test_internode.py", line 179, in <lambda>
t = bench(lambda: buffer.dispatch(**tune_args))[0]
File "/usr/local/lib/python3.10/site-packages/deep_ep-1.0.0+1590a08-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 282, in dispatch
return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
File "/usr/local/lib/python3.10/site-packages/deep_ep-1.0.0+1590a08-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 377, in internode_dispatch
recv_x, recv_x_scales, _, _, _, _, _, _, _, _, _, _, _, _, event = self.runtime.internode_dispatch(
RuntimeError: Failed: CUDA error /sgl-workspace/DeepEP/csrc/kernels/internode.cu:1214 'unspecified launch failure'
### RANK1
python tests/test_internode.py
[config] num_tokens=4096, hidden=7168, num_topk_groups=2, num_topk=8
[layout] Kernel performance: 0.074 ms
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 6.80 GB/s (RDMA), 22.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 9.22 GB/s (RDMA), 30.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 15.80 GB/s (RDMA), 52.12 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 20.63 GB/s (RDMA), 68.05 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 21.76 GB/s (RDMA), 71.79 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 22.47 GB/s (RDMA), 74.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 23.10 GB/s (RDMA), 76.21 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 22.82 GB/s (RDMA), 75.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 7.25 GB/s (RDMA), 23.93 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 8.92 GB/s (RDMA), 29.43 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 13.00 GB/s (RDMA), 42.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 19.46 GB/s (RDMA), 64.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 21.96 GB/s (RDMA), 72.44 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 22.55 GB/s (RDMA), 74.40 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 23.15 GB/s (RDMA), 76.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 23.34 GB/s (RDMA), 77.01 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 7.63 GB/s (RDMA), 25.18 GB/s (NVL)
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 0, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 0, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbfc90b9446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbfc90636e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbfc91a5a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fbfc916c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fbfc916da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fbfc916dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fbfc7dfaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fbfc909a69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fbfc909337b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fbfc9093529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fbfc80c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fbfc80c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fbfc9e29d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fbfc9e29e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55eebaf97095 in /usr/local/bin/python)
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 6, RDMA: 1, nvl: 4, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 5, RDMA: 1, nvl: 6, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 9, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 5, RDMA: 1, nvl: 6, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 6, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 4, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 3, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 3, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 7, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 4, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 1, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 10, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 10, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 1, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 11, RDMA: 1, nvl: 3, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 11, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 3, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 3, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 2, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 7, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 1, src nvl: 1, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 2, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 0, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 2, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 9, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 5, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 7, RDMA: 1, nvl: 5, src RDMA: 0, src nvl: 4, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 5, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 4, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 3, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 2, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 8, RDMA: 1, nvl: 5, src RDMA lane: 0, dst NVL: 0, meta: 0, 0, 0, 0
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc64fd6c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc64fd166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6501dca18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7fc6501a392e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7fc6501a4a57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7fc6501a4c5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7fc64e9faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7fc64fd4d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fc64fd4637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc64fd46529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7fc64ecc1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7fc64ecc1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7fc650a29d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7fc650a29e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x56290437b095 in /usr/local/bin/python)
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f42f26b9446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f42f26636e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f42f27a5a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f42f276c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f42f276da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f42f276dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f42f13faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f42f269a69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f42f269337b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f42f2693529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f42f16c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f42f16c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f42f3429d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f42f3429e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x55f1717b4095 in /usr/local/bin/python)
terminate called after throwing an instance of '
c10::Error'
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3cfb4b9446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f3cfb4636e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3cfb5a5a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f3cfb56c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f3cfb56da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f3cfb56dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f3cfa1faf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f3cfb49a69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f3cfb49337b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f3cfb493529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f3cfa4c1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f3cfa4c1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f3cfc229d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f3cfc229e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x557458bad095 in /usr/local/bin/python)
what(): CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f76e2f6c446 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f76e2f166e4 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f76e3365a18 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1f92e (0x7f76e332c92e in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20a57 (0x7f76e332da57 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x20c5f (0x7f76e332dc5f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5faf70 (0x7f76e1bfaf70 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6f69f (0x7f76e2f4d69f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f76e2f4637b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f76e2f46529 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x8c1a98 (0x7f76e1ec1a98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f76e1ec1de6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #32: <unknown function> + 0x29d90 (0x7f76e3c29d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: __libc_start_main + 0x80 (0x7f76e3c29e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x25 (0x5596561c6095 in /usr/local/bin/python)
W0501 13:38:46.315000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1430 via signal SIGTERM
W0501 13:38:46.315000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1431 via signal SIGTERM
W0501 13:38:46.316000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1432 via signal SIGTERM
W0501 13:38:46.317000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1433 via signal SIGTERM
W0501 13:38:46.317000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1435 via signal SIGTERM
W0501 13:38:46.317000 1364 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1436 via signal SIGTERM
Traceback (most recent call last):
File "/sgl-workspace/DeepEP/tests/test_internode.py", line 247, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 5 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/sgl-workspace/DeepEP/tests/test_internode.py", line 235, in test_loop
test_main(i, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
File "/sgl-workspace/DeepEP/tests/test_internode.py", line 179, in test_main
t = bench(lambda: buffer.dispatch(**tune_args))[0]
File "/sgl-workspace/DeepEP/tests/utils.py", line 81, in bench
fn()
File "/sgl-workspace/DeepEP/tests/test_internode.py", line 179, in <lambda>
t = bench(lambda: buffer.dispatch(**tune_args))[0]
File "/usr/local/lib/python3.10/site-packages/deep_ep-1.0.0+1590a08-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 282, in dispatch
return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
File "/usr/local/lib/python3.10/site-packages/deep_ep-1.0.0+1590a08-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 377, in internode_dispatch
recv_x, recv_x_scales, _, _, _, _, _, _, _, _, _, _, _, _, event = self.runtime.internode_dispatch(
RuntimeError: Failed: CUDA error /sgl-workspace/DeepEP/csrc/kernels/internode.cu:1079 'unspecified launch failure'
my env:
two node
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE PIX SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE PIX 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS NODE NODE 48-95,144-191 1 N/A
NIC0 NODE PIX NODE NODE SYS SYS SYS SYS X NODE SYS SYS
NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE
NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 32.39.3804
node_guid: 58a2:e103:00d5:28d4
sys_image_guid: 58a2:e103:00d5:28d4
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000000884
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_1
transport: InfiniBand (0)
fw_ver: 32.39.3804
node_guid: 58a2:e103:00f7:6a90
sys_image_guid: 58a2:e103:00f7:6a90
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000000884
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_2
transport: InfiniBand (0)
fw_ver: 32.39.3804
node_guid: 58a2:e103:00d8:061c
sys_image_guid: 58a2:e103:00d8:061c
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000000884
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_3
transport: InfiniBand (0)
fw_ver: 32.39.3804
node_guid: 58a2:e103:00dd:eee6
sys_image_guid: 58a2:e103:00dd:eee6
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000000884
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
@sphish In my test environment, I analyzed this problem. When I configure these two environment variables(NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"), the program will time out and reappear 100%.
If I don't configure these two environment variables, the program won't time out, but the bandwidth is only over 20 GB/s.
config: test_ll_compatibility=True export NVSHMEM_ENABLE_NIC_PE_MAPPING=1 export NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
test result:
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 18.57 GB/s (RDMA), 60.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 37.50 GB/s (RDMA), 122.79 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 41.02 GB/s (RDMA), 134.32 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 41.88 GB/s (RDMA), 137.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 39.79 GB/s (RDMA), 130.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 41.49 GB/s (RDMA), 135.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 40.73 GB/s (RDMA), 133.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 41.08 GB/s (RDMA), 134.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 17.82 GB/s (RDMA), 58.35 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 36.30 GB/s (RDMA), 118.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 41.62 GB/s (RDMA), 136.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 41.98 GB/s (RDMA), 137.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 42.10 GB/s (RDMA), 137.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 41.07 GB/s (RDMA), 134.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 41.75 GB/s (RDMA), 136.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 40.87 GB/s (RDMA), 133.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 17.91 GB/s (RDMA), 58.64 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 35.96 GB/s (RDMA), 117.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 42.50 GB/s (RDMA), 139.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 42.35 GB/s (RDMA), 138.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 42.00 GB/s (RDMA), 137.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 40.97 GB/s (RDMA), 134.18 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 41.41 GB/s (RDMA), 135.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 40.30 GB/s (RDMA), 131.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 17.86 GB/s (RDMA), 58.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 35.70 GB/s (RDMA), 116.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 42.54 GB/s (RDMA), 139.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 42.18 GB/s (RDMA), 138.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 41.95 GB/s (RDMA), 137.38 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 41.59 GB/s (RDMA), 136.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 41.31 GB/s (RDMA), 135.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 40.84 GB/s (RDMA), 133.74 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 18.29 GB/s (RDMA), 59.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 36.32 GB/s (RDMA), 118.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 42.53 GB/s (RDMA), 139.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 42.30 GB/s (RDMA), 138.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 41.94 GB/s (RDMA), 137.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 41.54 GB/s (RDMA), 136.02 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 41.34 GB/s (RDMA), 135.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 40.89 GB/s (RDMA), 133.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 17.86 GB/s (RDMA), 58.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 35.49 GB/s (RDMA), 116.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 41.48 GB/s (RDMA), 135.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 41.61 GB/s (RDMA), 136.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 41.89 GB/s (RDMA), 137.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 41.58 GB/s (RDMA), 136.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 41.01 GB/s (RDMA), 134.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 40.21 GB/s (RDMA), 131.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 17.75 GB/s (RDMA), 58.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 36.20 GB/s (RDMA), 118.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 42.47 GB/s (RDMA), 139.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 41.26 GB/s (RDMA), 135.10 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 41.98 GB/s (RDMA), 137.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 41.58 GB/s (RDMA), 136.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 41.26 GB/s (RDMA), 135.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 40.85 GB/s (RDMA), 133.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 18.23 GB/s (RDMA), 59.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 36.25 GB/s (RDMA), 118.70 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 42.53 GB/s (RDMA), 139.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 42.24 GB/s (RDMA), 138.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 41.95 GB/s (RDMA), 137.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 41.53 GB/s (RDMA), 136.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 41.32 GB/s (RDMA), 135.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 40.81 GB/s (RDMA), 133.64 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 16, RDMA chunk 12: 42.54 GB/s (RDMA), 139.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 35.60 GB/s (RDMA), 116.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 44.63 GB/s (RDMA), 146.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 44.27 GB/s (RDMA), 144.97 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 44.40 GB/s (RDMA), 145.40 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 44.09 GB/s (RDMA), 144.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 44.00 GB/s (RDMA), 144.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 43.79 GB/s (RDMA), 143.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 43.44 GB/s (RDMA), 142.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 34.80 GB/s (RDMA), 113.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 44.26 GB/s (RDMA), 144.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 44.06 GB/s (RDMA), 144.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 44.04 GB/s (RDMA), 144.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 44.03 GB/s (RDMA), 144.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 43.87 GB/s (RDMA), 143.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 43.67 GB/s (RDMA), 143.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 43.34 GB/s (RDMA), 141.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 34.43 GB/s (RDMA), 112.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 44.46 GB/s (RDMA), 145.59 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 44.23 GB/s (RDMA), 144.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 44.16 GB/s (RDMA), 144.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 43.92 GB/s (RDMA), 143.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 43.68 GB/s (RDMA), 143.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 43.41 GB/s (RDMA), 142.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 43.21 GB/s (RDMA), 141.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 35.71 GB/s (RDMA), 116.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 44.70 GB/s (RDMA), 146.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 44.03 GB/s (RDMA), 144.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 44.13 GB/s (RDMA), 144.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 43.81 GB/s (RDMA), 143.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 43.54 GB/s (RDMA), 142.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 43.29 GB/s (RDMA), 141.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 43.02 GB/s (RDMA), 140.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 35.72 GB/s (RDMA), 116.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 44.68 GB/s (RDMA), 146.32 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 43.73 GB/s (RDMA), 143.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 44.18 GB/s (RDMA), 144.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 43.82 GB/s (RDMA), 143.50 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 43.50 GB/s (RDMA), 142.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 43.24 GB/s (RDMA), 141.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 43.05 GB/s (RDMA), 140.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 33.22 GB/s (RDMA), 108.80 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 44.63 GB/s (RDMA), 146.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 44.19 GB/s (RDMA), 144.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 44.13 GB/s (RDMA), 144.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 43.79 GB/s (RDMA), 143.41 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 43.51 GB/s (RDMA), 142.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 43.23 GB/s (RDMA), 141.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 43.23 GB/s (RDMA), 141.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 34.83 GB/s (RDMA), 114.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 44.66 GB/s (RDMA), 146.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 44.20 GB/s (RDMA), 144.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 44.16 GB/s (RDMA), 144.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 43.83 GB/s (RDMA), 143.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 43.55 GB/s (RDMA), 142.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 43.25 GB/s (RDMA), 141.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 43.27 GB/s (RDMA), 141.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 34.91 GB/s (RDMA), 114.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 44.66 GB/s (RDMA), 146.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 44.26 GB/s (RDMA), 144.93 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 44.16 GB/s (RDMA), 144.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 43.83 GB/s (RDMA), 143.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 43.50 GB/s (RDMA), 142.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 43.23 GB/s (RDMA), 141.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 43.17 GB/s (RDMA), 141.38 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 16, RDMA chunk 8: 44.70 GB/s (RDMA), 146.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 42.17 GB/s (RDMA), 138.09 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 43.48 GB/s (RDMA), 142.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 43.14 GB/s (RDMA), 141.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 42.82 GB/s (RDMA), 140.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 42.39 GB/s (RDMA), 138.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 42.03 GB/s (RDMA), 137.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 41.81 GB/s (RDMA), 136.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 43.48 GB/s (RDMA), 142.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 43.45 GB/s (RDMA), 142.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 43.25 GB/s (RDMA), 141.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 42.93 GB/s (RDMA), 140.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 42.56 GB/s (RDMA), 139.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 42.23 GB/s (RDMA), 138.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 42.02 GB/s (RDMA), 137.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 43.66 GB/s (RDMA), 142.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 43.52 GB/s (RDMA), 142.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 43.22 GB/s (RDMA), 141.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 42.87 GB/s (RDMA), 140.38 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 42.42 GB/s (RDMA), 138.90 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 42.17 GB/s (RDMA), 138.09 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 42.05 GB/s (RDMA), 137.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 43.08 GB/s (RDMA), 141.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 43.42 GB/s (RDMA), 142.18 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 43.08 GB/s (RDMA), 141.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 42.79 GB/s (RDMA), 140.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 42.47 GB/s (RDMA), 139.06 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 42.08 GB/s (RDMA), 137.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 42.00 GB/s (RDMA), 137.53 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 8: 43.66 GB/s (RDMA), 142.96 GB/s (NVL)
[rank 0] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.74 us, min_t=168.26 us, max_t=179.33 us
[rank 5] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.72 us, min_t=167.87 us, max_t=176.86 us
[rank 3] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.94 us, min_t=167.78 us, max_t=177.95 us
[rank 1] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.97 us, min_t=166.18 us, max_t=178.37 us
[rank 6] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.74 us, min_t=167.10 us, max_t=177.66 us
[rank 7] Dispatch + combine bandwidth: 12.14 GB/s, avg_t=172.73 us, min_t=165.98 us, max_t=179.81 us
[rank 4] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.83 us, min_t=166.40 us, max_t=176.74 us
[rank 2] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.85 us, min_t=167.78 us, max_t=176.80 us
[rank 6] Dispatch bandwidth: 9.99 GB/s, avg_t=71.06 us | Combine bandwidth: 14.03 GB/s, avg_t=97.83 us
[rank 4] Dispatch bandwidth: 9.32 GB/s, avg_t=76.18 us | Combine bandwidth: 13.88 GB/s, avg_t=98.85 us
[rank 0] Dispatch bandwidth: 10.51 GB/s, avg_t=67.51 us | Combine bandwidth: 13.71 GB/s, avg_t=100.06 us
[rank 7] Dispatch bandwidth: 10.29 GB/s, avg_t=69.47 us | Combine bandwidth: 14.18 GB/s, avg_t=97.50 us
[rank 3] Dispatch bandwidth: 9.00 GB/s, avg_t=78.86 us | Combine bandwidth: 13.90 GB/s, avg_t=98.72 us
[rank 1] Dispatch bandwidth: 8.79 GB/s, avg_t=80.75 us | Combine bandwidth: 13.66 GB/s, avg_t=100.48 us
[rank 5] Dispatch bandwidth: 10.21 GB/s, avg_t=69.53 us | Combine bandwidth: 13.83 GB/s, avg_t=99.24 us
[rank 2] Dispatch bandwidth: 10.52 GB/s, avg_t=67.44 us | Combine bandwidth: 13.55 GB/s, avg_t=101.29 us
[rank 0] Dispatch send/recv time: 17.49 us | Combine send/recv time: 19.97 us
[rank 7] Dispatch send/recv time: 18.52 us | Combine send/recv time: 19.97 us
[rank 2] Dispatch send/recv time: 18.38 us | Combine send/recv time: 20.40 us
[rank 4] Dispatch send/recv time: 18.54 us | Combine send/recv time: 20.16 us
[rank 1] Dispatch send/recv time: 18.32 us | Combine send/recv time: 19.88 us
[rank 5] Dispatch send/recv time: 18.56 us | Combine send/recv time: 20.14 us
[rank 3] Dispatch send/recv time: 18.68 us | Combine send/recv time: 19.77 us
[rank 6] Dispatch send/recv time: 18.79 us | Combine send/recv time: 19.95 us
config: test_ll_compatibility=True export NVSHMEM_HCA_LIST="mlx5_bond_0:1,mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1"
test result:
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 11.47 GB/s (RDMA), 37.83 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 16.59 GB/s (RDMA), 54.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 19.27 GB/s (RDMA), 63.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 20.84 GB/s (RDMA), 68.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 22.06 GB/s (RDMA), 72.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 23.05 GB/s (RDMA), 76.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 23.40 GB/s (RDMA), 77.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 23.92 GB/s (RDMA), 78.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 11.48 GB/s (RDMA), 37.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 16.56 GB/s (RDMA), 54.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 19.19 GB/s (RDMA), 63.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 20.73 GB/s (RDMA), 68.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 21.96 GB/s (RDMA), 72.45 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 23.11 GB/s (RDMA), 76.24 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 23.56 GB/s (RDMA), 77.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 23.86 GB/s (RDMA), 78.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 16.53 GB/s (RDMA), 54.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 19.16 GB/s (RDMA), 63.22 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 20.56 GB/s (RDMA), 67.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 21.98 GB/s (RDMA), 72.50 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 23.15 GB/s (RDMA), 76.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 23.53 GB/s (RDMA), 77.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 23.87 GB/s (RDMA), 78.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 16.59 GB/s (RDMA), 54.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 19.19 GB/s (RDMA), 63.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 20.57 GB/s (RDMA), 67.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 22.00 GB/s (RDMA), 72.59 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 23.07 GB/s (RDMA), 76.11 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 23.51 GB/s (RDMA), 77.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 23.85 GB/s (RDMA), 78.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 11.48 GB/s (RDMA), 37.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 16.53 GB/s (RDMA), 54.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 19.18 GB/s (RDMA), 63.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 20.54 GB/s (RDMA), 67.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 22.02 GB/s (RDMA), 72.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 23.06 GB/s (RDMA), 76.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 23.53 GB/s (RDMA), 77.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 23.97 GB/s (RDMA), 79.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 11.45 GB/s (RDMA), 37.77 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 16.54 GB/s (RDMA), 54.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 19.18 GB/s (RDMA), 63.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 20.53 GB/s (RDMA), 67.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 21.79 GB/s (RDMA), 71.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 23.06 GB/s (RDMA), 76.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 23.51 GB/s (RDMA), 77.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 23.78 GB/s (RDMA), 78.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 11.46 GB/s (RDMA), 37.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 16.54 GB/s (RDMA), 54.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 19.18 GB/s (RDMA), 63.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 20.57 GB/s (RDMA), 67.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 22.07 GB/s (RDMA), 72.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 23.00 GB/s (RDMA), 75.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 23.59 GB/s (RDMA), 77.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 23.88 GB/s (RDMA), 78.78 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 16.57 GB/s (RDMA), 54.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 19.20 GB/s (RDMA), 63.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 20.54 GB/s (RDMA), 67.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 22.03 GB/s (RDMA), 72.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 23.04 GB/s (RDMA), 75.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 23.51 GB/s (RDMA), 77.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 23.84 GB/s (RDMA), 78.64 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 20, RDMA chunk 32: 23.97 GB/s (RDMA), 79.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 16.40 GB/s (RDMA), 54.11 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 20.66 GB/s (RDMA), 68.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.89 GB/s (RDMA), 75.51 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 24.60 GB/s (RDMA), 81.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 24.64 GB/s (RDMA), 81.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 24.58 GB/s (RDMA), 81.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 24.53 GB/s (RDMA), 80.93 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 24.50 GB/s (RDMA), 80.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 16.39 GB/s (RDMA), 54.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 20.72 GB/s (RDMA), 68.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 22.86 GB/s (RDMA), 75.40 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 24.58 GB/s (RDMA), 81.07 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 24.64 GB/s (RDMA), 81.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 24.61 GB/s (RDMA), 81.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 24.52 GB/s (RDMA), 80.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 24.45 GB/s (RDMA), 80.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 16.37 GB/s (RDMA), 54.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 20.71 GB/s (RDMA), 68.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 22.86 GB/s (RDMA), 75.43 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 24.60 GB/s (RDMA), 81.15 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 24.58 GB/s (RDMA), 81.10 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 24.48 GB/s (RDMA), 80.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 24.51 GB/s (RDMA), 80.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 24.44 GB/s (RDMA), 80.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 16.38 GB/s (RDMA), 54.03 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 20.71 GB/s (RDMA), 68.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 22.90 GB/s (RDMA), 75.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 24.62 GB/s (RDMA), 81.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 24.55 GB/s (RDMA), 81.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 24.48 GB/s (RDMA), 80.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 24.48 GB/s (RDMA), 80.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 24.45 GB/s (RDMA), 80.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 16.38 GB/s (RDMA), 54.04 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 20.70 GB/s (RDMA), 68.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 22.86 GB/s (RDMA), 75.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 24.58 GB/s (RDMA), 81.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 24.59 GB/s (RDMA), 81.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 24.44 GB/s (RDMA), 80.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 24.46 GB/s (RDMA), 80.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 24.41 GB/s (RDMA), 80.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 16.38 GB/s (RDMA), 54.03 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 20.72 GB/s (RDMA), 68.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 22.87 GB/s (RDMA), 75.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 24.63 GB/s (RDMA), 81.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 24.52 GB/s (RDMA), 80.90 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 24.50 GB/s (RDMA), 80.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 24.46 GB/s (RDMA), 80.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 24.42 GB/s (RDMA), 80.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 16.36 GB/s (RDMA), 53.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 20.69 GB/s (RDMA), 68.26 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 22.88 GB/s (RDMA), 75.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 24.63 GB/s (RDMA), 81.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 24.52 GB/s (RDMA), 80.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 24.39 GB/s (RDMA), 80.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 24.50 GB/s (RDMA), 80.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 24.42 GB/s (RDMA), 80.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 16.43 GB/s (RDMA), 54.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 20.72 GB/s (RDMA), 68.34 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 22.88 GB/s (RDMA), 75.47 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 24.63 GB/s (RDMA), 81.27 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 24.62 GB/s (RDMA), 81.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 24.47 GB/s (RDMA), 80.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 24.42 GB/s (RDMA), 80.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 24.35 GB/s (RDMA), 80.34 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 4, RDMA chunk 20: 24.64 GB/s (RDMA), 81.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 20.47 GB/s (RDMA), 67.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 23.02 GB/s (RDMA), 75.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 24.48 GB/s (RDMA), 80.77 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 24.41 GB/s (RDMA), 80.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 24.30 GB/s (RDMA), 80.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 24.20 GB/s (RDMA), 79.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 24.16 GB/s (RDMA), 79.69 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 20.81 GB/s (RDMA), 68.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 23.00 GB/s (RDMA), 75.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 24.48 GB/s (RDMA), 80.76 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 24.44 GB/s (RDMA), 80.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 24.36 GB/s (RDMA), 80.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 24.27 GB/s (RDMA), 80.05 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 24.24 GB/s (RDMA), 79.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 20.81 GB/s (RDMA), 68.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 22.98 GB/s (RDMA), 75.79 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 24.43 GB/s (RDMA), 80.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 24.43 GB/s (RDMA), 80.59 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 24.35 GB/s (RDMA), 80.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 24.24 GB/s (RDMA), 79.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 24.22 GB/s (RDMA), 79.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 20.85 GB/s (RDMA), 68.78 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.96 GB/s (RDMA), 75.74 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 24.42 GB/s (RDMA), 80.55 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 24.44 GB/s (RDMA), 80.62 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 24.32 GB/s (RDMA), 80.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 24.22 GB/s (RDMA), 79.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 24.20 GB/s (RDMA), 79.84 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 16: 24.48 GB/s (RDMA), 80.77 GB/s (NVL)
@whybeyoung Have you also configured these two environment variables? NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
Can you show me your machine environment?
command: nvidia-smi topo -m ibv_devinfo
Hi there, could you please try this branch https://github.com/deepseek-ai/DeepEP/tree/try_fix_roce_mqp and see if it resolves the issue?
Hi there, could you please try this branch https://github.com/deepseek-ai/DeepEP/tree/try_fix_roce_mqp and see if it resolves the issue?
Thank you for your help.
The submission still does not solve the problem, or the same error is reported. I tried many times and it was 100% reappearance.
my env:
two node
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE SYS SYS 0-47,96-143 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE PIX SYS SYS 0-47,96-143 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX NODE 48-95,144-191 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE NODE 48-95,144-191 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE PIX 48-95,144-191 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS NODE NODE 48-95,144-191 1 N/A NIC0 NODE PIX NODE NODE SYS SYS SYS SYS X NODE SYS SYS NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_bond_0 NIC1: mlx5_bond_1 NIC2: mlx5_bond_2 NIC3: mlx5_bond_3hca_id: mlx5_bond_0 transport: InfiniBand (0) fw_ver: 32.39.3804 node_guid: 58a2:e103:00d5:28d4 sys_image_guid: 58a2:e103:00d5:28d4 vendor_id: 0x02c9 vendor_part_id: 41692 hw_ver: 0x1 board_id: MT_0000000884 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_bond_1 transport: InfiniBand (0) fw_ver: 32.39.3804 node_guid: 58a2:e103:00f7:6a90 sys_image_guid: 58a2:e103:00f7:6a90 vendor_id: 0x02c9 vendor_part_id: 41692 hw_ver: 0x1 board_id: MT_0000000884 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_bond_2 transport: InfiniBand (0) fw_ver: 32.39.3804 node_guid: 58a2:e103:00d8:061c sys_image_guid: 58a2:e103:00d8:061c vendor_id: 0x02c9 vendor_part_id: 41692 hw_ver: 0x1 board_id: MT_0000000884 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_bond_3 transport: InfiniBand (0) fw_ver: 32.39.3804 node_guid: 58a2:e103:00dd:eee6 sys_image_guid: 58a2:e103:00dd:eee6 vendor_id: 0x02c9 vendor_part_id: 41692 hw_ver: 0x1 board_id: MT_0000000884 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet@sphish In my test environment, I analyzed this problem. When I configure these two environment variables(
NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"), the program will time out and reappear 100%.If I don't configure these two environment variables, the program won't time out, but the bandwidth is only over 20 GB/s.
config: test_ll_compatibility=True export NVSHMEM_ENABLE_NIC_PE_MAPPING=1 export NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
test result:
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 18.57 GB/s (RDMA), 60.82 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 37.50 GB/s (RDMA), 122.79 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 41.02 GB/s (RDMA), 134.32 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 41.88 GB/s (RDMA), 137.13 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 39.79 GB/s (RDMA), 130.31 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 41.49 GB/s (RDMA), 135.88 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 40.73 GB/s (RDMA), 133.36 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 41.08 GB/s (RDMA), 134.53 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 17.82 GB/s (RDMA), 58.35 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 36.30 GB/s (RDMA), 118.88 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 41.62 GB/s (RDMA), 136.28 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 41.98 GB/s (RDMA), 137.49 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 42.10 GB/s (RDMA), 137.86 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 41.07 GB/s (RDMA), 134.49 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 41.75 GB/s (RDMA), 136.72 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 40.87 GB/s (RDMA), 133.85 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 17.91 GB/s (RDMA), 58.64 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 35.96 GB/s (RDMA), 117.76 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 42.50 GB/s (RDMA), 139.17 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 42.35 GB/s (RDMA), 138.67 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 42.00 GB/s (RDMA), 137.55 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 40.97 GB/s (RDMA), 134.18 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 41.41 GB/s (RDMA), 135.62 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 40.30 GB/s (RDMA), 131.96 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 17.86 GB/s (RDMA), 58.48 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 35.70 GB/s (RDMA), 116.91 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 42.54 GB/s (RDMA), 139.30 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 42.18 GB/s (RDMA), 138.14 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 41.95 GB/s (RDMA), 137.38 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 41.59 GB/s (RDMA), 136.19 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 41.31 GB/s (RDMA), 135.29 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 40.84 GB/s (RDMA), 133.74 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 18.29 GB/s (RDMA), 59.91 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 36.32 GB/s (RDMA), 118.94 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 42.53 GB/s (RDMA), 139.28 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 42.30 GB/s (RDMA), 138.52 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 41.94 GB/s (RDMA), 137.34 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 41.54 GB/s (RDMA), 136.02 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 41.34 GB/s (RDMA), 135.37 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 40.89 GB/s (RDMA), 133.91 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 17.86 GB/s (RDMA), 58.48 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 35.49 GB/s (RDMA), 116.23 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 41.48 GB/s (RDMA), 135.85 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 41.61 GB/s (RDMA), 136.25 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 41.89 GB/s (RDMA), 137.16 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 41.58 GB/s (RDMA), 136.16 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 41.01 GB/s (RDMA), 134.28 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 40.21 GB/s (RDMA), 131.68 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 17.75 GB/s (RDMA), 58.13 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 36.20 GB/s (RDMA), 118.55 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 42.47 GB/s (RDMA), 139.07 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 41.26 GB/s (RDMA), 135.10 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 41.98 GB/s (RDMA), 137.46 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 41.58 GB/s (RDMA), 136.17 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 41.26 GB/s (RDMA), 135.13 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 40.85 GB/s (RDMA), 133.76 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 18.23 GB/s (RDMA), 59.71 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 36.25 GB/s (RDMA), 118.70 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 42.53 GB/s (RDMA), 139.27 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 42.24 GB/s (RDMA), 138.31 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 41.95 GB/s (RDMA), 137.39 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 41.53 GB/s (RDMA), 136.00 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 41.32 GB/s (RDMA), 135.33 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 40.81 GB/s (RDMA), 133.64 GB/s (NVL) [tuning] Best dispatch (FP8): SMs 24, NVL chunk 16, RDMA chunk 12: 42.54 GB/s (RDMA), 139.30 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 35.60 GB/s (RDMA), 116.57 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 44.63 GB/s (RDMA), 146.14 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 44.27 GB/s (RDMA), 144.97 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 44.40 GB/s (RDMA), 145.40 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 44.09 GB/s (RDMA), 144.39 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 44.00 GB/s (RDMA), 144.08 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 43.79 GB/s (RDMA), 143.39 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 43.44 GB/s (RDMA), 142.24 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 34.80 GB/s (RDMA), 113.95 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 44.26 GB/s (RDMA), 144.94 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 44.06 GB/s (RDMA), 144.30 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 44.04 GB/s (RDMA), 144.23 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 44.03 GB/s (RDMA), 144.19 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 43.87 GB/s (RDMA), 143.66 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 43.67 GB/s (RDMA), 143.00 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 43.34 GB/s (RDMA), 141.94 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 34.43 GB/s (RDMA), 112.76 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 44.46 GB/s (RDMA), 145.59 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 44.23 GB/s (RDMA), 144.84 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 44.16 GB/s (RDMA), 144.60 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 43.92 GB/s (RDMA), 143.84 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 43.68 GB/s (RDMA), 143.04 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 43.41 GB/s (RDMA), 142.16 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 43.21 GB/s (RDMA), 141.49 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 35.71 GB/s (RDMA), 116.96 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 44.70 GB/s (RDMA), 146.37 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 44.03 GB/s (RDMA), 144.20 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 44.13 GB/s (RDMA), 144.52 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 43.81 GB/s (RDMA), 143.48 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 43.54 GB/s (RDMA), 142.58 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 43.29 GB/s (RDMA), 141.75 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 43.02 GB/s (RDMA), 140.88 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 35.72 GB/s (RDMA), 116.98 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 44.68 GB/s (RDMA), 146.32 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 43.73 GB/s (RDMA), 143.22 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 44.18 GB/s (RDMA), 144.68 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 43.82 GB/s (RDMA), 143.50 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 43.50 GB/s (RDMA), 142.47 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 43.24 GB/s (RDMA), 141.61 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 43.05 GB/s (RDMA), 140.99 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 33.22 GB/s (RDMA), 108.80 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 44.63 GB/s (RDMA), 146.16 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 44.19 GB/s (RDMA), 144.73 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 44.13 GB/s (RDMA), 144.52 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 43.79 GB/s (RDMA), 143.41 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 43.51 GB/s (RDMA), 142.48 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 43.23 GB/s (RDMA), 141.56 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 43.23 GB/s (RDMA), 141.56 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 34.83 GB/s (RDMA), 114.07 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 44.66 GB/s (RDMA), 146.24 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 44.20 GB/s (RDMA), 144.76 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 44.16 GB/s (RDMA), 144.63 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 43.83 GB/s (RDMA), 143.54 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 43.55 GB/s (RDMA), 142.61 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 43.25 GB/s (RDMA), 141.63 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 43.27 GB/s (RDMA), 141.71 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 34.91 GB/s (RDMA), 114.34 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 44.66 GB/s (RDMA), 146.24 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 44.26 GB/s (RDMA), 144.93 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 44.16 GB/s (RDMA), 144.62 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 43.83 GB/s (RDMA), 143.54 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 43.50 GB/s (RDMA), 142.46 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 43.23 GB/s (RDMA), 141.57 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 43.17 GB/s (RDMA), 141.38 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 16, RDMA chunk 8: 44.70 GB/s (RDMA), 146.37 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 42.17 GB/s (RDMA), 138.09 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 43.48 GB/s (RDMA), 142.39 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 43.14 GB/s (RDMA), 141.27 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 42.82 GB/s (RDMA), 140.22 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 42.39 GB/s (RDMA), 138.82 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 42.03 GB/s (RDMA), 137.62 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 41.81 GB/s (RDMA), 136.91 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 43.48 GB/s (RDMA), 142.39 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 43.45 GB/s (RDMA), 142.29 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 43.25 GB/s (RDMA), 141.62 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 42.93 GB/s (RDMA), 140.60 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 42.56 GB/s (RDMA), 139.37 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 42.23 GB/s (RDMA), 138.30 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 42.02 GB/s (RDMA), 137.61 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 43.66 GB/s (RDMA), 142.96 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 43.52 GB/s (RDMA), 142.52 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 43.22 GB/s (RDMA), 141.54 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 42.87 GB/s (RDMA), 140.38 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 42.42 GB/s (RDMA), 138.90 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 42.17 GB/s (RDMA), 138.09 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 42.05 GB/s (RDMA), 137.72 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 43.08 GB/s (RDMA), 141.07 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 43.42 GB/s (RDMA), 142.18 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 43.08 GB/s (RDMA), 141.08 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 42.79 GB/s (RDMA), 140.14 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 42.47 GB/s (RDMA), 139.06 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 42.08 GB/s (RDMA), 137.81 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 42.00 GB/s (RDMA), 137.53 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 3, RDMA chunk 8: 43.66 GB/s (RDMA), 142.96 GB/s (NVL) [rank 0] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.74 us, min_t=168.26 us, max_t=179.33 us [rank 5] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.72 us, min_t=167.87 us, max_t=176.86 us [rank 3] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.94 us, min_t=167.78 us, max_t=177.95 us [rank 1] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.97 us, min_t=166.18 us, max_t=178.37 us [rank 6] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.74 us, min_t=167.10 us, max_t=177.66 us [rank 7] Dispatch + combine bandwidth: 12.14 GB/s, avg_t=172.73 us, min_t=165.98 us, max_t=179.81 us [rank 4] Dispatch + combine bandwidth: 12.05 GB/s, avg_t=172.83 us, min_t=166.40 us, max_t=176.74 us [rank 2] Dispatch + combine bandwidth: 12.04 GB/s, avg_t=172.85 us, min_t=167.78 us, max_t=176.80 us [rank 6] Dispatch bandwidth: 9.99 GB/s, avg_t=71.06 us | Combine bandwidth: 14.03 GB/s, avg_t=97.83 us [rank 4] Dispatch bandwidth: 9.32 GB/s, avg_t=76.18 us | Combine bandwidth: 13.88 GB/s, avg_t=98.85 us [rank 0] Dispatch bandwidth: 10.51 GB/s, avg_t=67.51 us | Combine bandwidth: 13.71 GB/s, avg_t=100.06 us [rank 7] Dispatch bandwidth: 10.29 GB/s, avg_t=69.47 us | Combine bandwidth: 14.18 GB/s, avg_t=97.50 us [rank 3] Dispatch bandwidth: 9.00 GB/s, avg_t=78.86 us | Combine bandwidth: 13.90 GB/s, avg_t=98.72 us [rank 1] Dispatch bandwidth: 8.79 GB/s, avg_t=80.75 us | Combine bandwidth: 13.66 GB/s, avg_t=100.48 us [rank 5] Dispatch bandwidth: 10.21 GB/s, avg_t=69.53 us | Combine bandwidth: 13.83 GB/s, avg_t=99.24 us [rank 2] Dispatch bandwidth: 10.52 GB/s, avg_t=67.44 us | Combine bandwidth: 13.55 GB/s, avg_t=101.29 us [rank 0] Dispatch send/recv time: 17.49 us | Combine send/recv time: 19.97 us [rank 7] Dispatch send/recv time: 18.52 us | Combine send/recv time: 19.97 us [rank 2] Dispatch send/recv time: 18.38 us | Combine send/recv time: 20.40 us [rank 4] Dispatch send/recv time: 18.54 us | Combine send/recv time: 20.16 us [rank 1] Dispatch send/recv time: 18.32 us | Combine send/recv time: 19.88 us [rank 5] Dispatch send/recv time: 18.56 us | Combine send/recv time: 20.14 us [rank 3] Dispatch send/recv time: 18.68 us | Combine send/recv time: 19.77 us [rank 6] Dispatch send/recv time: 18.79 us | Combine send/recv time: 19.95 usconfig: test_ll_compatibility=True export NVSHMEM_HCA_LIST="mlx5_bond_0:1,mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1"
test result:
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 11.47 GB/s (RDMA), 37.83 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 16.59 GB/s (RDMA), 54.72 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 19.27 GB/s (RDMA), 63.56 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 20.84 GB/s (RDMA), 68.76 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 22.06 GB/s (RDMA), 72.76 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 23.05 GB/s (RDMA), 76.04 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 23.40 GB/s (RDMA), 77.20 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 23.92 GB/s (RDMA), 78.89 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 11.48 GB/s (RDMA), 37.88 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 16.56 GB/s (RDMA), 54.63 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 19.19 GB/s (RDMA), 63.31 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 20.73 GB/s (RDMA), 68.39 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 21.96 GB/s (RDMA), 72.45 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 23.11 GB/s (RDMA), 76.24 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 23.56 GB/s (RDMA), 77.73 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 23.86 GB/s (RDMA), 78.72 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 16.53 GB/s (RDMA), 54.54 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 19.16 GB/s (RDMA), 63.22 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 20.56 GB/s (RDMA), 67.82 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 21.98 GB/s (RDMA), 72.50 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 23.15 GB/s (RDMA), 76.37 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 23.53 GB/s (RDMA), 77.61 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 23.87 GB/s (RDMA), 78.76 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 16.59 GB/s (RDMA), 54.72 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 19.19 GB/s (RDMA), 63.30 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 20.57 GB/s (RDMA), 67.85 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 22.00 GB/s (RDMA), 72.59 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 23.07 GB/s (RDMA), 76.11 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 23.51 GB/s (RDMA), 77.54 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 23.85 GB/s (RDMA), 78.67 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 11.48 GB/s (RDMA), 37.88 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 16.53 GB/s (RDMA), 54.53 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 19.18 GB/s (RDMA), 63.27 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 20.54 GB/s (RDMA), 67.75 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 22.02 GB/s (RDMA), 72.65 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 23.06 GB/s (RDMA), 76.08 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 23.53 GB/s (RDMA), 77.62 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 23.97 GB/s (RDMA), 79.07 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 11.45 GB/s (RDMA), 37.77 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 16.54 GB/s (RDMA), 54.58 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 19.18 GB/s (RDMA), 63.28 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 20.53 GB/s (RDMA), 67.73 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 21.79 GB/s (RDMA), 71.89 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 23.06 GB/s (RDMA), 76.07 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 23.51 GB/s (RDMA), 77.57 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 23.78 GB/s (RDMA), 78.46 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 11.46 GB/s (RDMA), 37.81 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 16.54 GB/s (RDMA), 54.58 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 19.18 GB/s (RDMA), 63.27 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 20.57 GB/s (RDMA), 67.86 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 22.07 GB/s (RDMA), 72.81 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 23.00 GB/s (RDMA), 75.87 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 23.59 GB/s (RDMA), 77.81 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 23.88 GB/s (RDMA), 78.78 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 11.48 GB/s (RDMA), 37.86 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 16.57 GB/s (RDMA), 54.66 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 19.20 GB/s (RDMA), 63.34 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 20.54 GB/s (RDMA), 67.75 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 22.03 GB/s (RDMA), 72.68 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 23.04 GB/s (RDMA), 75.99 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 23.51 GB/s (RDMA), 77.56 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 23.84 GB/s (RDMA), 78.64 GB/s (NVL) [tuning] Best dispatch (FP8): SMs 24, NVL chunk 20, RDMA chunk 32: 23.97 GB/s (RDMA), 79.07 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 16.40 GB/s (RDMA), 54.11 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 20.66 GB/s (RDMA), 68.16 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.89 GB/s (RDMA), 75.51 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 24.60 GB/s (RDMA), 81.16 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 24.64 GB/s (RDMA), 81.30 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 24.58 GB/s (RDMA), 81.08 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 24.53 GB/s (RDMA), 80.93 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 24.50 GB/s (RDMA), 80.82 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 16.39 GB/s (RDMA), 54.08 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 20.72 GB/s (RDMA), 68.36 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 22.86 GB/s (RDMA), 75.40 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 24.58 GB/s (RDMA), 81.07 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 24.64 GB/s (RDMA), 81.27 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 24.61 GB/s (RDMA), 81.20 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 24.52 GB/s (RDMA), 80.89 GB/s (NVL) [tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 24.45 GB/s (RDMA), 80.65 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 16.37 GB/s (RDMA), 54.00 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 20.71 GB/s (RDMA), 68.33 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 22.86 GB/s (RDMA), 75.43 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 24.60 GB/s (RDMA), 81.15 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 24.58 GB/s (RDMA), 81.10 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 24.48 GB/s (RDMA), 80.76 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 24.51 GB/s (RDMA), 80.87 GB/s (NVL) [tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 24.44 GB/s (RDMA), 80.62 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 16.38 GB/s (RDMA), 54.03 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 20.71 GB/s (RDMA), 68.33 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 22.90 GB/s (RDMA), 75.53 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 24.62 GB/s (RDMA), 81.23 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 24.55 GB/s (RDMA), 81.00 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 24.48 GB/s (RDMA), 80.75 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 24.48 GB/s (RDMA), 80.75 GB/s (NVL) [tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 24.45 GB/s (RDMA), 80.67 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 16.38 GB/s (RDMA), 54.04 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 20.70 GB/s (RDMA), 68.28 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 22.86 GB/s (RDMA), 75.42 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 24.58 GB/s (RDMA), 81.08 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 24.59 GB/s (RDMA), 81.13 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 24.44 GB/s (RDMA), 80.61 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 24.46 GB/s (RDMA), 80.67 GB/s (NVL) [tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 24.41 GB/s (RDMA), 80.52 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 16.38 GB/s (RDMA), 54.03 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 20.72 GB/s (RDMA), 68.36 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 22.87 GB/s (RDMA), 75.46 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 24.63 GB/s (RDMA), 81.25 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 24.52 GB/s (RDMA), 80.90 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 24.50 GB/s (RDMA), 80.81 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 24.46 GB/s (RDMA), 80.71 GB/s (NVL) [tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 24.42 GB/s (RDMA), 80.55 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 16.36 GB/s (RDMA), 53.98 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 20.69 GB/s (RDMA), 68.26 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 22.88 GB/s (RDMA), 75.47 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 24.63 GB/s (RDMA), 81.25 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 24.52 GB/s (RDMA), 80.87 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 24.39 GB/s (RDMA), 80.46 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 24.50 GB/s (RDMA), 80.82 GB/s (NVL) [tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 24.42 GB/s (RDMA), 80.56 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 16.43 GB/s (RDMA), 54.19 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 20.72 GB/s (RDMA), 68.34 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 22.88 GB/s (RDMA), 75.47 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 24.63 GB/s (RDMA), 81.27 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 24.62 GB/s (RDMA), 81.20 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 24.47 GB/s (RDMA), 80.71 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 24.42 GB/s (RDMA), 80.57 GB/s (NVL) [tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 24.35 GB/s (RDMA), 80.34 GB/s (NVL) [tuning] Best dispatch (BF16): SMs 24, NVL chunk 4, RDMA chunk 20: 24.64 GB/s (RDMA), 81.30 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 20.47 GB/s (RDMA), 67.54 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 23.02 GB/s (RDMA), 75.95 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 24.48 GB/s (RDMA), 80.77 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 24.41 GB/s (RDMA), 80.52 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 24.30 GB/s (RDMA), 80.17 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 24.20 GB/s (RDMA), 79.84 GB/s (NVL) [tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 24.16 GB/s (RDMA), 79.69 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 20.81 GB/s (RDMA), 68.65 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 23.00 GB/s (RDMA), 75.87 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 24.48 GB/s (RDMA), 80.76 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 24.44 GB/s (RDMA), 80.63 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 24.36 GB/s (RDMA), 80.36 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 24.27 GB/s (RDMA), 80.05 GB/s (NVL) [tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 24.24 GB/s (RDMA), 79.96 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 20.81 GB/s (RDMA), 68.66 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 22.98 GB/s (RDMA), 75.79 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 24.43 GB/s (RDMA), 80.58 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 24.43 GB/s (RDMA), 80.59 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 24.35 GB/s (RDMA), 80.33 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 24.24 GB/s (RDMA), 79.98 GB/s (NVL) [tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 24.22 GB/s (RDMA), 79.89 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 20.85 GB/s (RDMA), 68.78 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.96 GB/s (RDMA), 75.74 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 24.42 GB/s (RDMA), 80.55 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 24.44 GB/s (RDMA), 80.62 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 24.32 GB/s (RDMA), 80.23 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 24.22 GB/s (RDMA), 79.91 GB/s (NVL) [tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 24.20 GB/s (RDMA), 79.84 GB/s (NVL) [tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 16: 24.48 GB/s (RDMA), 80.77 GB/s (NVL)@whybeyoung Have you also configured these two environment variables? NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
Can you show me your machine environment? command:
nvidia-smi topo -mibv_devinfo
[root@maas-h20-007 ~]# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE PIX SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE PIX 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS NODE NODE 48-95,144-191 1 N/A
NIC0 NODE PIX NODE NODE SYS SYS SYS SYS X NODE SYS SYS
NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE
NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
[root@maas-h20-007 ~]# ibv_devinfo
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 32.39.3920
node_guid: e09d:7303:0074:6f94
sys_image_guid: e09d:7303:0074:6f94
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000001093
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_1
transport: InfiniBand (0)
fw_ver: 32.39.3920
node_guid: e09d:7303:0074:038c
sys_image_guid: e09d:7303:0074:038c
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000001093
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_2
transport: InfiniBand (0)
fw_ver: 32.39.3920
node_guid: e09d:7303:0095:135e
sys_image_guid: e09d:7303:0095:135e
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000001093
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_3
transport: InfiniBand (0)
fw_ver: 32.39.3920
node_guid: e09d:7303:0074:03b8
sys_image_guid: e09d:7303:0074:03b8
vendor_id: 0x02c9
vendor_part_id: 41692
hw_ver: 0x1
board_id: MT_0000001093
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
I found a way to fix the problem.
config env:
NVSHMEM_ENABLE_NIC_PE_MAPPING=1
NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
hack in nvshmem :
int nvshmemi_setup_connections(nvshmemi_state_t *state) {
int status = 0;
nvshmem_transport_t *transports = (nvshmem_transport_t *)state->transports;
nvshmem_transport_t tcurr;
int savedDev = 0;
cudaError_t ret = cudaSuccess;
for (int i = 0; i < state->num_initialized_transports; i++) {
if (!((state->transport_bitmap) & (1 << i))) continue;
tcurr = transports[i];
if (!(tcurr->attr & NVSHMEM_TRANSPORT_ATTR_CONNECTED)) {
continue;
}
int devices_temp = tcurr->n_devices / state->npes_node;
if (devices_temp == 0) devices_temp = 1;
const int max_devices_per_pe = devices_temp;
int selected_devices[max_devices_per_pe];
int found_devices = 0;
for (int j = 0; j < max_devices_per_pe; j++) {
selected_devices[j] = -1;
}
// assumes symmetry of transport list at all PEs
if (tcurr->n_devices <= 1) {
/* return the index of the first available device.
* -1 if no devices found.
*/
selected_devices[0] = tcurr->n_devices - 1;
found_devices++;
} else if (nvshmemi_options.ENABLE_NIC_PE_MAPPING) {
selected_devices[0] =
nvshmemi_state->mype_node % (tcurr->n_devices > 0 ? tcurr->n_devices : 1);
ret = cudaGetDevice(&savedDev);
if (ret != cudaSuccess) {
status = -3;
goto out;
}
selected_devices[0] = savedDev % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); // fix in here
INFO(NVSHMEM_INIT,
"pid:[%d] NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, savedDev: %d, setting dev_id = %d", getpid(), savedDev, selected_devices[0]);
INFO(NVSHMEM_INIT, "NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, setting dev_id = %d",
selected_devices[0]);
found_devices++;
} else {
nvshmemi_get_devices_by_distance(selected_devices, max_devices_per_pe, tcurr);
for (int i = 0; i < max_devices_per_pe; i++) {
if (selected_devices[i] == -1) {
break;
}
found_devices++;
INFO(NVSHMEM_INIT,
"NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device %d setting dev_id = %d", i,
selected_devices[i]);
}
}
/* setting n_devices to 0 is the transports way of
* letting us know it's managing devices internally.
*/
if (tcurr->n_devices > 0 && selected_devices[0] == -1) {
NVSHMEMI_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "No devices selected.\n");
}
status = tcurr->host_ops.connect_endpoints(tcurr, selected_devices, found_devices);
NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "connect EPS failed \n");
status = nvshmemi_boot_handle.barrier(&nvshmemi_boot_handle);
NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "barrier failed \n");
status = nvshmemi_update_device_state();
}
out:
return status;
}
@alpha-baby Hi, I'm not very familiar with bonded NICs. Would it be possible to add me on WeChat: Sphizzz? I have a few questions I'd like to ask.
I found a way to fix the problem.
config env:
NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2" hack in nvshmem :
int nvshmemi_setup_connections(nvshmemi_state_t *state) { int status = 0; nvshmem_transport_t *transports = (nvshmem_transport_t *)state->transports; nvshmem_transport_t tcurr; int savedDev = 0; cudaError_t ret = cudaSuccess;
for (int i = 0; i < state->num_initialized_transports; i++) { if (!((state->transport_bitmap) & (1 << i))) continue; tcurr = transports[i]; if (!(tcurr->attr & NVSHMEM_TRANSPORT_ATTR_CONNECTED)) { continue; } int devices_temp = tcurr->n_devices / state->npes_node; if (devices_temp == 0) devices_temp = 1; const int max_devices_per_pe = devices_temp; int selected_devices[max_devices_per_pe]; int found_devices = 0; for (int j = 0; j < max_devices_per_pe; j++) { selected_devices[j] = -1; } // assumes symmetry of transport list at all PEs if (tcurr->n_devices <= 1) { /* return the index of the first available device. * -1 if no devices found. */ selected_devices[0] = tcurr->n_devices - 1; found_devices++; } else if (nvshmemi_options.ENABLE_NIC_PE_MAPPING) { selected_devices[0] = nvshmemi_state->mype_node % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); ret = cudaGetDevice(&savedDev); if (ret != cudaSuccess) { status = -3; goto out; } selected_devices[0] = savedDev % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); // fix in here INFO(NVSHMEM_INIT, "pid:[%d] NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, savedDev: %d, setting dev_id = %d", getpid(), savedDev, selected_devices[0]); INFO(NVSHMEM_INIT, "NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, setting dev_id = %d", selected_devices[0]); found_devices++; } else { nvshmemi_get_devices_by_distance(selected_devices, max_devices_per_pe, tcurr); for (int i = 0; i < max_devices_per_pe; i++) { if (selected_devices[i] == -1) { break; } found_devices++; INFO(NVSHMEM_INIT, "NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device %d setting dev_id = %d", i, selected_devices[i]); } } /* setting n_devices to 0 is the transports way of * letting us know it's managing devices internally. */ if (tcurr->n_devices > 0 && selected_devices[0] == -1) { NVSHMEMI_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "No devices selected.\n"); } status = tcurr->host_ops.connect_endpoints(tcurr, selected_devices, found_devices); NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "connect EPS failed \n"); status = nvshmemi_boot_handle.barrier(&nvshmemi_boot_handle); NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "barrier failed \n"); status = nvshmemi_update_device_state(); }out: return status; }
your solution not works for me
Hi there, could you please try this branch https://github.com/deepseek-ai/DeepEP/tree/try_fix_roce_mqp and see if it resolves the issue?
yes this resolve my test timeout problem
https://github.com/deepseek-ai/DeepEP/tree/try_fix_roce_mqp
@sphish May I ask if this change will be incorporated into the main branch?
@polarstormx Have you tested this modification and confirmed its effectiveness? Since I haven't been able to reproduce this issue internally, I need to collect some feedback.
I found a way to fix the problem.
config env:
NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2" hack in nvshmem :
int nvshmemi_setup_connections(nvshmemi_state_t *state) { int status = 0; nvshmem_transport_t *transports = (nvshmem_transport_t *)state->transports; nvshmem_transport_t tcurr; int savedDev = 0; cudaError_t ret = cudaSuccess;
for (int i = 0; i < state->num_initialized_transports; i++) { if (!((state->transport_bitmap) & (1 << i))) continue; tcurr = transports[i]; if (!(tcurr->attr & NVSHMEM_TRANSPORT_ATTR_CONNECTED)) { continue; } int devices_temp = tcurr->n_devices / state->npes_node; if (devices_temp == 0) devices_temp = 1; const int max_devices_per_pe = devices_temp; int selected_devices[max_devices_per_pe]; int found_devices = 0; for (int j = 0; j < max_devices_per_pe; j++) { selected_devices[j] = -1; } // assumes symmetry of transport list at all PEs if (tcurr->n_devices <= 1) { /* return the index of the first available device. * -1 if no devices found. */ selected_devices[0] = tcurr->n_devices - 1; found_devices++; } else if (nvshmemi_options.ENABLE_NIC_PE_MAPPING) { selected_devices[0] = nvshmemi_state->mype_node % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); ret = cudaGetDevice(&savedDev); if (ret != cudaSuccess) { status = -3; goto out; } selected_devices[0] = savedDev % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); // fix in here INFO(NVSHMEM_INIT, "pid:[%d] NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, savedDev: %d, setting dev_id = %d", getpid(), savedDev, selected_devices[0]); INFO(NVSHMEM_INIT, "NVSHMEM_ENABLE_NIC_PE_MAPPING = 1, setting dev_id = %d", selected_devices[0]); found_devices++; } else { nvshmemi_get_devices_by_distance(selected_devices, max_devices_per_pe, tcurr); for (int i = 0; i < max_devices_per_pe; i++) { if (selected_devices[i] == -1) { break; } found_devices++; INFO(NVSHMEM_INIT, "NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device %d setting dev_id = %d", i, selected_devices[i]); } } /* setting n_devices to 0 is the transports way of * letting us know it's managing devices internally. */ if (tcurr->n_devices > 0 && selected_devices[0] == -1) { NVSHMEMI_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "No devices selected.\n"); } status = tcurr->host_ops.connect_endpoints(tcurr, selected_devices, found_devices); NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "connect EPS failed \n"); status = nvshmemi_boot_handle.barrier(&nvshmemi_boot_handle); NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out, "barrier failed \n"); status = nvshmemi_update_device_state(); }out: return status; }
not work on H20, tested with latest DeepEP: bb393e7760f94eb93878f4d62d967a58bd2d777d
@cscyuge Can you show me your machine environment? command: nvidia-smi topo -m ibv_devinfo
@cscyuge Can you show me your machine environment? command:
nvidia-smi topo -mibv_devinfo
nvidia-smi topo -m:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX PHB NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE PHB PIX NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE 96-191,288-383 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODE NODE 96-191,288-383 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIX PHB 96-191,288-383 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE PHB PIX 96-191,288-383 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE PIX PHB NODE SYS SYS SYS SYS NODE X PHB NODE SYS SYS SYS SYS
NIC2 NODE PHB PIX NODE SYS SYS SYS SYS NODE PHB X NODE SYS SYS SYS SYS
NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE
NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE
NIC6 SYS SYS SYS SYS NODE NODE PIX PHB SYS SYS SYS SYS NODE NODE X PHB
NIC7 SYS SYS SYS SYS NODE NODE PHB PIX SYS SYS SYS SYS NODE NODE PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
NIC4: mlx5_bond_4
NIC5: mlx5_bond_5
NIC6: mlx5_bond_6
NIC7: mlx5_bond_7
ibv_devinfo:
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: 5c25:7303:0094:212a
sys_image_guid: 5c25:7303:0094:212a
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000834
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_1
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: 5c25:7303:0094:3566
sys_image_guid: 5c25:7303:0094:3566
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000834
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_2
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: 5c25:7303:0094:4256
sys_image_guid: 5c25:7303:0094:4256
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000834
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_3
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: 5c25:7303:0094:214a
sys_image_guid: 5c25:7303:0094:214a
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000834
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_4
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: 5c25:7303:0094:284a
sys_image_guid: 5c25:7303:0094:284a
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000834
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_5
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: 5c25:7303:0094:213a
sys_image_guid: 5c25:7303:0094:213a
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000834
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_6
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: 5c25:7303:0094:23ba
sys_image_guid: 5c25:7303:0094:23ba
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000834
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_bond_7
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: 5c25:7303:0094:288a
sys_image_guid: 5c25:7303:0094:288a
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000834
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
Your machine topo environment is different from mine, mine only has four network cards.
My patch should not apply to you, so you don't need to configure NVSHMEM_ENABLE_NIC_PE_MAPPING configuration.
you should config: NVSHMEM_ENABLE_NIC_PE_MAPPING=0 @cscyuge
Your machine topo environment is different from mine, mine only has four network cards. My patch should not apply to you, so you don't need to configure
NVSHMEM_ENABLE_NIC_PE_MAPPINGconfiguration.you should config:
NVSHMEM_ENABLE_NIC_PE_MAPPING=0@cscyuge
I have applied the change of nvshmemi_setup_connections and tried 4 8*H20 nodes with:
# node 0
NVSHMEM_ENABLE_NIC_PE_MAPPING=0 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2,mlx5_bond_4:1:2,mlx5_bond_5:1:2,mlx5_bond_6:1:2,mlx5_bond_7:1:2" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=0 python test_internode.py
# node 1
NVSHMEM_ENABLE_NIC_PE_MAPPING=0 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2,mlx5_bond_4:1:2,mlx5_bond_5:1:2,mlx5_bond_6:1:2,mlx5_bond_7:1:2" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=1 python test_internode.py
# node 2
NVSHMEM_ENABLE_NIC_PE_MAPPING=0 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2,mlx5_bond_4:1:2,mlx5_bond_5:1:2,mlx5_bond_6:1:2,mlx5_bond_7:1:2" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=2 python test_internode.py
# node 3
NVSHMEM_ENABLE_NIC_PE_MAPPING=0 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2,mlx5_bond_4:1:2,mlx5_bond_5:1:2,mlx5_bond_6:1:2,mlx5_bond_7:1:2" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=3 python test_internode.py
I got the same timeout error running with the script above.
And I tried 2 8*H20 nodes, got another error message:
...
[rank 3] Dispatch send/recv time: 1001.35 us | Combine send/recv time: 1182.14 us
[rank 0] Dispatch send/recv time: 153.66 us | Combine send/recv time: 195.56 us
[rank 5] Dispatch send/recv time: 96.00 us | Combine send/recv time: 104.76 us
[rank 1] Dispatch send/recv time: 18.15 us | Combine send/recv time: 20.66 us
[rank 7] Dispatch send/recv time: 268.17 us | Combine send/recv time: 308.93 us
[rank 6] Dispatch send/recv time: 19.01 us | Combine send/recv time: 20.84 us
[rank 2] Dispatch send/recv time: 269.79 us | Combine send/recv time: 337.55 us
[rank 4] Dispatch send/recv time: 18.34 us | Combine send/recv time: 20.84 us
[VM-12-10-centos:87868:0:87868] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x561aef362)
[VM-12-10-centos:87866:0:87866] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55ae9bb36)
[VM-12-10-centos:87867:0:87867] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x559eddbd4)
[VM-12-10-centos:87863:0:87863] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55ff4cc52)
[VM-12-10-centos:87869:0:87869] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x5628dadb0)
[VM-12-10-centos:87865:0:87865] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x56549fda5)
[VM-12-10-centos:87864:0:87864] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55613f4f7)
[VM-12-10-centos:87862:0:87862] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x5577c60e0)
^@==== backtrace (tid: 87863) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000019d57 ibv_dealloc_pd() ???:0
2 0x000000000000ce6d nvshmemt_ibrc_finalize() :0
3 0x0000000000220ab2 nvshmemi_transport_finalize() ???:0
4 0x00000000000b49f9 nvshmemid_hostlib_finalize() ???:0
5 0x00000000001b301f nvshmemi_finalize() ???:0
6 0x0000000000055252 deep_ep::Buffer::~Buffer() /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()() /usr/include/c++/11/bits/unique_ptr.h:85
8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr() /usr/include/c++/11/bits/unique_ptr.h:361
9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc() /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance() :0
11 0x00000000005174d1 pybind11_object_dealloc() :0
12 0x0000000000169b93 _Py_CheckFunctionResult() ???:0
13 0x00000000001a2407 PyObject_DelItem() ???:0
14 0x0000000000181370 PyMapping_Check() ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall() ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
17 0x000000000018b66c _PyFunction_Vectorcall() ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
19 0x000000000018b66c _PyFunction_Vectorcall() ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
21 0x000000000018b66c _PyFunction_Vectorcall() ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
23 0x000000000018b66c _PyFunction_Vectorcall() ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault() ???:0
25 0x000000000018b66c _PyFunction_Vectorcall() ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault() ???:0
27 0x0000000000259f56 PyEval_EvalCode() ???:0
28 0x0000000000259e26 PyEval_EvalCode() ???:0
29 0x0000000000280808 PyUnicode_Tailmatch() ???:0
30 0x000000000027b00f PyInit__collections() ???:0
31 0x0000000000274d91 PyRun_StringFlags() ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags() ???:0
33 0x0000000000273f70 Py_RunMain() ???:0
34 0x000000000024de6d Py_BytesMain() ???:0
35 0x0000000000029d90 __libc_init_first() ???:0
36 0x0000000000029e40 __libc_start_main() ???:0
37 0x000000000024dd65 _start() ???:0
=================================
==== backtrace (tid: 87866) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000019d57 ibv_dealloc_pd() ???:0
2 0x000000000000ce6d nvshmemt_ibrc_finalize() :0
3 0x0000000000220ab2 nvshmemi_transport_finalize() ???:0
4 0x00000000000b49f9 nvshmemid_hostlib_finalize() ???:0
5 0x00000000001b301f nvshmemi_finalize() ???:0
6 0x0000000000055252 deep_ep::Buffer::~Buffer() /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()() /usr/include/c++/11/bits/unique_ptr.h:85
8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr() /usr/include/c++/11/bits/unique_ptr.h:361
9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc() /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance() :0
11 0x00000000005174d1 pybind11_object_dealloc() :0
12 0x0000000000169b93 _Py_CheckFunctionResult() ???:0
13 0x00000000001a2407 PyObject_DelItem() ???:0
14 0x0000000000181370 PyMapping_Check() ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall() ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
17 0x000000000018b66c _PyFunction_Vectorcall() ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
19 0x000000000018b66c _PyFunction_Vectorcall() ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
21 0x000000000018b66c _PyFunction_Vectorcall() ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
23 0x000000000018b66c _PyFunction_Vectorcall() ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault() ???:0
25 0x000000000018b66c _PyFunction_Vectorcall() ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault() ???:0
27 0x0000000000259f56 PyEval_EvalCode() ???:0
28 0x0000000000259e26 PyEval_EvalCode() ???:0
29 0x0000000000280808 PyUnicode_Tailmatch() ???:0
30 0x000000000027b00f PyInit__collections() ???:0
31 0x0000000000274d91 PyRun_StringFlags() ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags() ???:0
33 0x0000000000273f70 Py_RunMain() ???:0
34 0x000000000024de6d Py_BytesMain() ???:0
35 0x0000000000029d90 __libc_init_first() ???:0
36 0x0000000000029e40 __libc_start_main() ???:0
37 0x000000000024dd65 _start() ???:0
=================================
==== backtrace (tid: 87868) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000019d57 ibv_dealloc_pd() ???:0
2 0x000000000000ce6d nvshmemt_ibrc_finalize() :0
3 0x0000000000220ab2 nvshmemi_transport_finalize() ???:0
4 0x00000000000b49f9 nvshmemid_hostlib_finalize() ???:0
5 0x00000000001b301f nvshmemi_finalize() ???:0
6 0x0000000000055252 deep_ep::Buffer::~Buffer() /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()() /usr/include/c++/11/bits/unique_ptr.h:85
8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr() /usr/include/c++/11/bits/unique_ptr.h:361
9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc() /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance() :0
11 0x00000000005174d1 pybind11_object_dealloc() :0
12 0x0000000000169b93 _Py_CheckFunctionResult() ???:0
13 0x00000000001a2407 PyObject_DelItem() ???:0
14 0x0000000000181370 PyMapping_Check() ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall() ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
17 0x000000000018b66c _PyFunction_Vectorcall() ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
19 0x000000000018b66c _PyFunction_Vectorcall() ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
21 0x000000000018b66c _PyFunction_Vectorcall() ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
23 0x000000000018b66c _PyFunction_Vectorcall() ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault() ???:0
25 0x000000000018b66c _PyFunction_Vectorcall() ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault() ???:0
27 0x0000000000259f56 PyEval_EvalCode() ???:0
28 0x0000000000259e26 PyEval_EvalCode() ???:0
29 0x0000000000280808 PyUnicode_Tailmatch() ???:0
30 0x000000000027b00f PyInit__collections() ???:0
31 0x0000000000274d91 PyRun_StringFlags() ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags() ???:0
33 0x0000000000273f70 Py_RunMain() ???:0
34 0x000000000024de6d Py_BytesMain() ???:0
35 0x0000000000029d90 __libc_init_first() ???:0
36 0x0000000000029e40 __libc_start_main() ???:0
37 0x000000000024dd65 _start() ???:0
=================================
==== backtrace (tid: 87865) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000019d57 ibv_dealloc_pd() ???:0
2 0x000000000000ce6d nvshmemt_ibrc_finalize() :0
3 0x0000000000220ab2 nvshmemi_transport_finalize() ???:0
4 0x00000000000b49f9 nvshmemid_hostlib_finalize() ???:0
5 0x00000000001b301f nvshmemi_finalize() ???:0
6 0x0000000000055252 deep_ep::Buffer::~Buffer() /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()() /usr/include/c++/11/bits/unique_ptr.h:85
8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr() /usr/include/c++/11/bits/unique_ptr.h:361
9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc() /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance() :0
11 0x00000000005174d1 pybind11_object_dealloc() :0
12 0x0000000000169b93 _Py_CheckFunctionResult() ???:0
13 0x00000000001a2407 PyObject_DelItem() ???:0
14 0x0000000000181370 PyMapping_Check() ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall() ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
17 0x000000000018b66c _PyFunction_Vectorcall() ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
19 0x000000000018b66c _PyFunction_Vectorcall() ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
21 0x000000000018b66c _PyFunction_Vectorcall() ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
23 0x000000000018b66c _PyFunction_Vectorcall() ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault() ???:0
25 0x000000000018b66c _PyFunction_Vectorcall() ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault() ???:0
27 0x0000000000259f56 PyEval_EvalCode() ???:0
28 0x0000000000259e26 PyEval_EvalCode() ???:0
29 0x0000000000280808 PyUnicode_Tailmatch() ???:0
30 0x000000000027b00f PyInit__collections() ???:0
31 0x0000000000274d91 PyRun_StringFlags() ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags() ???:0
33 0x0000000000273f70 Py_RunMain() ???:0
34 0x000000000024de6d Py_BytesMain() ???:0
35 0x0000000000029d90 __libc_init_first() ???:0
36 0x0000000000029e40 __libc_start_main() ???:0
37 0x000000000024dd65 _start() ???:0
=================================
==== backtrace (tid: 87862) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000001a0d6 ibv_dereg_mr() ???:0
2 0x000000000000cddc nvshmemt_ibrc_finalize() :0
3 0x0000000000220ab2 nvshmemi_transport_finalize() ???:0
4 0x00000000000b49f9 nvshmemid_hostlib_finalize() ???:0
5 0x00000000001b301f nvshmemi_finalize() ???:0
6 0x0000000000055252 deep_ep::Buffer::~Buffer() /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()() /usr/include/c++/11/bits/unique_ptr.h:85
8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr() /usr/include/c++/11/bits/unique_ptr.h:361
9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc() /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance() :0
11 0x00000000005174d1 pybind11_object_dealloc() :0
12 0x0000000000169b93 _Py_CheckFunctionResult() ???:0
13 0x00000000001a2407 PyObject_DelItem() ???:0
14 0x0000000000181370 PyMapping_Check() ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall() ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
17 0x000000000018b66c _PyFunction_Vectorcall() ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
19 0x000000000018b66c _PyFunction_Vectorcall() ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
21 0x000000000018b66c _PyFunction_Vectorcall() ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
23 0x000000000018b66c _PyFunction_Vectorcall() ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault() ???:0
25 0x000000000018b66c _PyFunction_Vectorcall() ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault() ???:0
27 0x0000000000259f56 PyEval_EvalCode() ???:0
28 0x0000000000259e26 PyEval_EvalCode() ???:0
29 0x0000000000280808 PyUnicode_Tailmatch() ???:0
30 0x000000000027b00f PyInit__collections() ???:0
31 0x0000000000274d91 PyRun_StringFlags() ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags() ???:0
33 0x0000000000273f70 Py_RunMain() ???:0
34 0x000000000024de6d Py_BytesMain() ???:0
35 0x0000000000029d90 __libc_init_first() ???:0
36 0x0000000000029e40 __libc_start_main() ???:0
37 0x000000000024dd65 _start() ???:0
=================================
==== backtrace (tid: 87869) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000019d57 ibv_dealloc_pd() ???:0
2 0x000000000000ce6d nvshmemt_ibrc_finalize() :0
3 0x0000000000220ab2 nvshmemi_transport_finalize() ???:0
4 0x00000000000b49f9 nvshmemid_hostlib_finalize() ???:0
5 0x00000000001b301f nvshmemi_finalize() ???:0
6 0x0000000000055252 deep_ep::Buffer::~Buffer() /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()() /usr/include/c++/11/bits/unique_ptr.h:85
8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr() /usr/include/c++/11/bits/unique_ptr.h:361
9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc() /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance() :0
11 0x00000000005174d1 pybind11_object_dealloc() :0
12 0x0000000000169b93 _Py_CheckFunctionResult() ???:0
13 0x00000000001a2407 PyObject_DelItem() ???:0
14 0x0000000000181370 PyMapping_Check() ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall() ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
17 0x000000000018b66c _PyFunction_Vectorcall() ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
19 0x000000000018b66c _PyFunction_Vectorcall() ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
21 0x000000000018b66c _PyFunction_Vectorcall() ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
23 0x000000000018b66c _PyFunction_Vectorcall() ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault() ???:0
25 0x000000000018b66c _PyFunction_Vectorcall() ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault() ???:0
27 0x0000000000259f56 PyEval_EvalCode() ???:0
28 0x0000000000259e26 PyEval_EvalCode() ???:0
29 0x0000000000280808 PyUnicode_Tailmatch() ???:0
30 0x000000000027b00f PyInit__collections() ???:0
31 0x0000000000274d91 PyRun_StringFlags() ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags() ???:0
33 0x0000000000273f70 Py_RunMain() ???:0
34 0x000000000024de6d Py_BytesMain() ???:0
35 0x0000000000029d90 __libc_init_first() ???:0
36 0x0000000000029e40 __libc_start_main() ???:0
37 0x000000000024dd65 _start() ???:0
=================================
==== backtrace (tid: 87864) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000019d57 ibv_dealloc_pd() ???:0
2 0x000000000000ce6d nvshmemt_ibrc_finalize() :0
3 0x0000000000220ab2 nvshmemi_transport_finalize() ???:0
4 0x00000000000b49f9 nvshmemid_hostlib_finalize() ???:0
5 0x00000000001b301f nvshmemi_finalize() ???:0
6 0x0000000000055252 deep_ep::Buffer::~Buffer() /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()() /usr/include/c++/11/bits/unique_ptr.h:85
8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr() /usr/include/c++/11/bits/unique_ptr.h:361
9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc() /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance() :0
11 0x00000000005174d1 pybind11_object_dealloc() :0
12 0x0000000000169b93 _Py_CheckFunctionResult() ???:0
13 0x00000000001a2407 PyObject_DelItem() ???:0
14 0x0000000000181370 PyMapping_Check() ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall() ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
17 0x000000000018b66c _PyFunction_Vectorcall() ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
19 0x000000000018b66c _PyFunction_Vectorcall() ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
21 0x000000000018b66c _PyFunction_Vectorcall() ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
23 0x000000000018b66c _PyFunction_Vectorcall() ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault() ???:0
25 0x000000000018b66c _PyFunction_Vectorcall() ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault() ???:0
27 0x0000000000259f56 PyEval_EvalCode() ???:0
28 0x0000000000259e26 PyEval_EvalCode() ???:0
29 0x0000000000280808 PyUnicode_Tailmatch() ???:0
30 0x000000000027b00f PyInit__collections() ???:0
31 0x0000000000274d91 PyRun_StringFlags() ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags() ???:0
33 0x0000000000273f70 Py_RunMain() ???:0
34 0x000000000024de6d Py_BytesMain() ???:0
35 0x0000000000029d90 __libc_init_first() ???:0
36 0x0000000000029e40 __libc_start_main() ???:0
37 0x000000000024dd65 _start() ???:0
=================================
==== backtrace (tid: 87867) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000019d57 ibv_dealloc_pd() ???:0
2 0x000000000000ce6d nvshmemt_ibrc_finalize() :0
3 0x0000000000220ab2 nvshmemi_transport_finalize() ???:0
4 0x00000000000b49f9 nvshmemid_hostlib_finalize() ???:0
5 0x00000000001b301f nvshmemi_finalize() ???:0
6 0x0000000000055252 deep_ep::Buffer::~Buffer() /mnt/yscfs/linjunxian/DeepEP/csrc/deep_ep.cpp:106
7 0x0000000000068f86 std::default_delete<deep_ep::Buffer>::operator()() /usr/include/c++/11/bits/unique_ptr.h:85
8 0x0000000000068f86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr() /usr/include/c++/11/bits/unique_ptr.h:361
9 0x0000000000068f86 pybind11::class_<deep_ep::Buffer>::dealloc() /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance() :0
11 0x00000000005174d1 pybind11_object_dealloc() :0
12 0x0000000000169b93 _Py_CheckFunctionResult() ???:0 [0/1824]
13 0x00000000001a2407 PyObject_DelItem() ???:0
14 0x0000000000181370 PyMapping_Check() ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall() ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
17 0x000000000018b66c _PyFunction_Vectorcall() ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
19 0x000000000018b66c _PyFunction_Vectorcall() ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
21 0x000000000018b66c _PyFunction_Vectorcall() ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
23 0x000000000018b66c _PyFunction_Vectorcall() ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault() ???:0
25 0x000000000018b66c _PyFunction_Vectorcall() ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault() ???:0
27 0x0000000000259f56 PyEval_EvalCode() ???:0
28 0x0000000000259e26 PyEval_EvalCode() ???:0
29 0x0000000000280808 PyUnicode_Tailmatch() ???:0
30 0x000000000027b00f PyInit__collections() ???:0
31 0x0000000000274d91 PyRun_StringFlags() ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags() ???:0
33 0x0000000000273f70 Py_RunMain() ???:0
34 0x000000000024de6d Py_BytesMain() ???:0
35 0x0000000000029d90 __libc_init_first() ???:0
36 0x0000000000029e40 __libc_start_main() ???:0
37 0x000000000024dd65 _start() ???:0
=================================
W0514 11:36:50.744000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87862 via signal SIGTERM
W0514 11:36:50.744000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87863 via signal SIGTERM
W0514 11:36:50.744000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87864 via signal SIGTERM
W0514 11:36:50.745000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87865 via signal SIGTERM
W0514 11:36:50.745000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87866 via signal SIGTERM
W0514 11:36:50.745000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87867 via signal SIGTERM
W0514 11:36:50.745000 87797 torch/multiprocessing/spawn.py:169] Terminating process 87869 via signal SIGTERM
Traceback (most recent call last):
File "/mnt/yscfs/linjunxian/DeepEP/tests/test_internode.py", line 247, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 196, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 6 terminated with signal SIGSEGV
If I remove NVSHMEM_ENABLE_NIC_PE_MAPPING and NVSHMEM_HCA_PE_MAPPING, 2 8*H20 nodes can pass test_internode.py (can also pass without the change of nvshmemi_setup_connections ),
but 4 nodes still timeout.
you should config:
# node 0
NVSHMEM_ENABLE_NIC_PE_MAPPING=1
NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:1,mlx5_bond_1:1:1,mlx5_bond_2:1:1,mlx5_bond_3:1:1,mlx5_bond_4:1:1,mlx5_bond_5:1:1,mlx5_bond_6:1:1,mlx5_bond_7:1:1" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=0 python test_internode.py
# node 1
NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:1,mlx5_bond_1:1:1,mlx5_bond_2:1:1,mlx5_bond_3:1:1,mlx5_bond_4:1:1,mlx5_bond_5:1:1,mlx5_bond_6:1:1,mlx5_bond_7:1:1" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=1 python test_internode.py
# node 2
NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:1,mlx5_bond_1:1:1,mlx5_bond_2:1:1,mlx5_bond_3:1:1,mlx5_bond_4:1:1,mlx5_bond_5:1:1,mlx5_bond_6:1:1,mlx5_bond_7:1:1" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=2 python test_internode.py
# node 3
NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:1,mlx5_bond_1:1:1,mlx5_bond_2:1:1,mlx5_bond_3:1:1,mlx5_bond_4:1:1,mlx5_bond_5:1:1,mlx5_bond_6:1:1,mlx5_bond_7:1:1" NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 MASTER_ADDR=<ip> WORLD_SIZE=4 RANK=3 python test_internode.py