DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

crash when set NVSHMEM_IBGDA_ENABLE_MULTI_PORT to 1 in case of dual-port RNIC

Open yanminjia opened this issue 6 months ago • 1 comments

When I ran test_internode.py in case of dual-port, the environment variable NVSHMEM_IBGDA_ENABLE_MULTI_PORT is set to 1. Unfortunately, DeepEP crashed when create rdma team by calling nvshmem_team_split_strided(...) in the following code snippet (runtime.cu:init(...)):

  if (low_latency_mode and num_ranks > NUM_MAX_NVL_PEERS) {
        EP_HOST_ASSERT(cpu_rdma_team == NVSHMEM_TEAM_INVALID);
        EP_HOST_ASSERT(num_ranks % NUM_MAX_NVL_PEERS == 0);
        EP_HOST_ASSERT(nvshmem_team_split_strided(NVSHMEM_TEAM_WORLD, rank % NUM_MAX_NVL_PEERS, NUM_MAX_NVL_PEERS,
                                                  num_ranks / NUM_MAX_NVL_PEERS, &cpu_rdma_team_config, 0, &cpu_rdma_team) == 0);
        EP_HOST_ASSERT(cpu_rdma_team != NVSHMEM_TEAM_INVALID);
    }

And print below error message:

[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered

Any suggestion should be highly appreciated. Many thanks.

yanminjia avatar Jun 10 '25 06:06 yanminjia

We haven’t used this environment variable before and are not familiar with how it works. I think you should consult the NVSHMEM team for more information.

sphish avatar Jun 10 '25 10:06 sphish