DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Crash when test DeepEP over 16 H100 Servers

Open yanminjia opened this issue 8 months ago • 1 comments

When we test DeepEP (test_internode.py) over 16 H100 servers, dispatch phase is finished successfully. Unfortunately, DeepEP crashed in combine phase. It looks DeepEP crashed at an asert statement. It would be highly appreciated if any clue. Thanks.

File "/usr/local/lib/python3.10/dist-packages/deep_ep-1.0.0+84d3d6f-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 426, in internode_combine
    combined_x, combined_topk_weights, event = self.runtime.internode_combine(
RuntimeError: Failed: Assertion error /workspace/DeepEP_splitChannel/csrc/kernels/[internode.cu:1713](http://internode.cu:1713/) 'num_max_nvl_chunked_recv_tokens / num_rdma_ranks > std::max(num_max_rdma_chunked_send_tokens, num_max_nvl_chunked_send_tokens)'

Detailed error message as following:

[tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 29.99 GB/s (RDMA), 57.96 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 31.43 GB/s (RDMA), 60.75 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 32.22 GB/s (RDMA), 62.27 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 32.74 GB/s (RDMA), 63.27 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 29.50 GB/s (RDMA), 57.00 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 32.32 GB/s (RDMA), 62.46 GB/s (NVL) 
su14-gpu14:61577:61577 [2] NVSHMEM INFO [61577] in nvshmem_finalize:
su14-gpu14:61578:61578 [3] NVSHMEM INFO [61578] in nvshmem_finalize:
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 9 nranks 16 tag 0 - ENTER
su14-gpu14:61575:61575 [0] NVSHMEM INFO [61575] in nvshmem_finalize:
su14-gpu14:61582:61582 [7] NVSHMEM INFO [61582] in nvshmem_finalize:
su14-gpu14:61581:61581 [6] NVSHMEM INFO [61581] in nvshmem_finalize:
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 9 nranks 16 tag 0 - ENTER
su14-gpu14:61580:61580 [5] NVSHMEM INFO [61580] in nvshmem_finalize:
su14-gpu14:61579:61579 [4] NVSHMEM INFO [61579] in nvshmem_finalize:
su14-gpu14:61576:61576 [1] NVSHMEM INFO [61576] in nvshmem_finalize:
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 9 nranks 16 tag 0 - ENTER
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 9 nranks 16 tag 0 - ENTER
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 9 nranks 16 tag 0 - ENTER
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 9 nranks 16 tag 0 - ENTER
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 9 nranks 16 tag 4 - DONE
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 9 nranks 16 tag 0 - ENTER
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 9 nranks 16 tag 0 - ENTER
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 9 nranks 16 tag 4 - DONE
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 9 nranks 16 tag 4 - DONE
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 9 nranks 16 tag 4 - DONE
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 9 nranks 16 tag 4 - DONE
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 9 nranks 16 tag 4 - DONE
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 9 nranks 16 tag 4 - DONE
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 9 nranks 16 tag 4 - DONE
su14-gpu14:61577:61577 [2] NVSHMEM INFO In nvshmemi_proxy_finalize
su14-gpu14:61578:61578 [3] NVSHMEM INFO In nvshmemi_proxy_finalize
su14-gpu14:61582:61582 [7] NVSHMEM INFO In nvshmemi_proxy_finalize
su14-gpu14:61575:61575 [0] NVSHMEM INFO In nvshmemi_proxy_finalize
su14-gpu14:61581:61581 [6] NVSHMEM INFO In nvshmemi_proxy_finalize
su14-gpu14:61577:61577 [2] NVSHMEM INFO In nvshmemi_teardown_handles
su14-gpu14:61579:61579 [4] NVSHMEM INFO In nvshmemi_proxy_finalize
su14-gpu14:61576:61576 [1] NVSHMEM INFO In nvshmemi_proxy_finalize
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x564a0f9a0770 handle->mr 0x564a0b232f80
su14-gpu14:61580:61580 [5] NVSHMEM INFO In nvshmemi_proxy_finalize
su14-gpu14:61578:61578 [3] NVSHMEM INFO In nvshmemi_teardown_handles
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x55fa92e76660 handle->mr 0x55fa8ed46f90
su14-gpu14:61582:61582 [7] NVSHMEM INFO In nvshmemi_teardown_handles
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x5642ba151660 handle->mr 0x5642b6592db0
su14-gpu14:61575:61575 [0] NVSHMEM INFO In nvshmemi_teardown_handles
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x556865f58770 handle->mr 0x556861a3fec0
su14-gpu14:61581:61581 [6] NVSHMEM INFO In nvshmemi_teardown_handles
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x55db3d3a9770 handle->mr 0x55db38c3bf80
su14-gpu14:61579:61579 [4] NVSHMEM INFO In nvshmemi_teardown_handles
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x560dec32e770 handle->mr 0x560de806af60
su14-gpu14:61580:61580 [5] NVSHMEM INFO In nvshmemi_teardown_handles
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x55e05c7cb770 handle->mr 0x55e0585ced70
su14-gpu14:61576:61576 [1] NVSHMEM INFO In nvshmemi_teardown_handles
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x561437a8c770 handle->mr 0x5614333e5f90
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x564a0f9a4780 handle->mr 0x564a0b9f8fb0
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x5642ba155670 handle->mr 0x5642b68aefb0
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x564a0f9a8790 handle->mr 0x564a08d44600
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x5642ba159680 handle->mr 0x5642b34f42f0
su14-gpu14:61582:61582 [7] NVSHMEM INFO In nvshmemi_transport_finalize
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x55e05c7cf780 handle->mr 0x55e0588eafb0
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x55fa92e7a670 handle->mr 0x55fa8f50cfb0
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x55e05c7d3790 handle->mr 0x55e055b6f790
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x556865f5c780 handle->mr 0x556861ee9fb0
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x55db3d3ad780 handle->mr 0x55db39401fb0
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x55fa92e7e680 handle->mr 0x55fa8c2194b0
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x556865f60790 handle->mr 0x55685f2fc250
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x55db3d3b1790 handle->mr 0x55db3674d7f0
su14-gpu14:61577:61577 [2] NVSHMEM INFO In nvshmemi_transport_finalize
su14-gpu14:61581:61581 [6] NVSHMEM INFO In nvshmemi_transport_finalize
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x560dec332780 handle->mr 0x560de844dfb0
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x560dec336790 handle->mr 0x560de56d2900
su14-gpu14:61578:61578 [3] NVSHMEM INFO In nvshmemi_transport_finalize
su14-gpu14:61575:61575 [0] NVSHMEM INFO In nvshmemi_transport_finalize
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x561437a90780 handle->mr 0x561433babfb0
/workspace/nvshmem_3.2.5_single_ports_multiQP/src/modules/transport/common/transport_ib_common.cpp 117 ibv_dereg_mr handle 0x561437a94790 handle->mr 0x561430e30ba0
su14-gpu14:61579:61579 [4] NVSHMEM INFO In nvshmemi_transport_finalize
su14-gpu14:61580:61580 [5] NVSHMEM INFO In nvshmemi_transport_finalize
su14-gpu14:61576:61576 [1] NVSHMEM INFO In nvshmemi_transport_finalize
[rank72]:[W408 12:42:38.879825368 ProcessGroupNCCL.cpp:1262] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0408 12:42:39.285000 61509 torch/multiprocessing/spawn.py:160] Terminating process 61575 via signal SIGTERM
W0408 12:42:39.286000 61509 torch/multiprocessing/spawn.py:160] Terminating process 61576 via signal SIGTERM
W0408 12:42:39.286000 61509 torch/multiprocessing/spawn.py:160] Terminating process 61577 via signal SIGTERM
W0408 12:42:39.287000 61509 torch/multiprocessing/spawn.py:160] Terminating process 61578 via signal SIGTERM
W0408 12:42:39.287000 61509 torch/multiprocessing/spawn.py:160] Terminating process 61579 via signal SIGTERM
W0408 12:42:39.287000 61509 torch/multiprocessing/spawn.py:160] Terminating process 61581 via signal SIGTERM
W0408 12:42:39.287000 61509 torch/multiprocessing/spawn.py:160] Terminating process 61582 via signal SIGTERM
Traceback (most recent call last):
  File "/workspace/DeepEP_splitChannel/tests/test_internode.py", line 247, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 203, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/workspace/DeepEP_splitChannel/tests/test_internode.py", line 235, in test_loop
    test_main(i, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
  File "/workspace/DeepEP_splitChannel/tests/test_internode.py", line 210, in test_main
    t = bench(lambda: buffer.combine(**tune_args))[0]
  File "/workspace/DeepEP_splitChannel/tests/utils.py", line 81, in bench
    fn()
  File "/workspace/DeepEP_splitChannel/tests/test_internode.py", line 210, in <lambda>
    t = bench(lambda: buffer.combine(**tune_args))[0]
  File "/usr/local/lib/python3.10/dist-packages/deep_ep-1.0.0+84d3d6f-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 343, in combine
    return self.internode_combine(x, handle, topk_weights, config, previous_event, async_finish, allocate_on_comm_stream)
  File "/usr/local/lib/python3.10/dist-packages/deep_ep-1.0.0+84d3d6f-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 426, in internode_combine
    combined_x, combined_topk_weights, event = self.runtime.internode_combine(
RuntimeError: Failed: Assertion error /workspace/DeepEP_splitChannel/csrc/kernels/[internode.cu:1713](http://internode.cu:1713/) 'num_max_nvl_chunked_recv_tokens / num_rdma_ranks > std::max(num_max_rdma_chunked_send_tokens, num_max_nvl_chunked_send_tokens)'

yanminjia avatar Apr 09 '25 02:04 yanminjia

The config you make is invalid, deep_ep_cpp.Config(num_sms, num_max_nvl_chunked_send_tokens, num_max_nvl_chunked_recv_tokens, num_max_rdma_chunked_send_tokens, num_max_rdma_chunked_recv_tokens) requires num_max_nvl_chunked_recv_tokens / num_rdma_ranks (i.e. GPUs / 8) > max(num_max_rdma_chunked_send_tokens, num_max_rdma_chunked_recv_tokens).

Changing your config should work.

LyricZhao avatar Apr 10 '25 01:04 LyricZhao

@LyricZhao can you please help with similar assertion error over here #237 ? Thanks :)

aahouzi avatar Jul 10 '25 07:07 aahouzi