crash when set NVSHMEM_IBGDA_ENABLE_MULTI_PORT to 1 in case of dual-port RNIC
When I ran test_internode.py in case of dual-port, the environment variable NVSHMEM_IBGDA_ENABLE_MULTI_PORT is set to 1. Unfortunately, DeepEP crashed when create rdma team by calling nvshmem_team_split_strided(...) in the following code snippet (runtime.cu:init(...)):
if (low_latency_mode and num_ranks > NUM_MAX_NVL_PEERS) {
EP_HOST_ASSERT(cpu_rdma_team == NVSHMEM_TEAM_INVALID);
EP_HOST_ASSERT(num_ranks % NUM_MAX_NVL_PEERS == 0);
EP_HOST_ASSERT(nvshmem_team_split_strided(NVSHMEM_TEAM_WORLD, rank % NUM_MAX_NVL_PEERS, NUM_MAX_NVL_PEERS,
num_ranks / NUM_MAX_NVL_PEERS, &cpu_rdma_team_config, 0, &cpu_rdma_team) == 0);
EP_HOST_ASSERT(cpu_rdma_team != NVSHMEM_TEAM_INVALID);
}
And print below error message:
[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered [/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered
Any suggestion should be highly appreciated. Many thanks.
We haven’t used this environment variable before and are not familiar with how it works. I think you should consult the NVSHMEM team for more information.