DeepEP test_inter_node.py failed: an illegal memory access was encountered

Description:

I test deepep in 2 nodes * 8 H20 mechaine. I can pass the test_intranode.py. But fail in test_internode.py. Could you please help me with that?

Error output:

command: NCCL_SOCKET_IFNAME=bond1 WORLD_SIZE=2 MASTER_PORT=4567 RANK=0 python3 tests/test_internode.py output:

python3 tests/test_internode.py 
[TENCENT64:5489 :0:5489] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfeffffff95)
/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp:304: NULL value cq creation failed 

/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp:1395: non-zero status: 7 ep_create failed

[TENCENT64:5490 :0:5490] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x38)
[TENCENT64:5493 :0:5493] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10102464c450f)
/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp:304: NULL value cq creation failed 

/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp:1395: non-zero status: 7 ep_create failed

[TENCENT64:5491 :0:5491] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x38)
[TENCENT64:5492 :0:5492] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2000000000008)
[TENCENT64:5487 :0:5487] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2000000000008)
/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp:304: NULL value cq creation failed 

/sgl-workspace/nvshmem/src/modules/transport/ibrc/ibrc.cpp:1395: non-zero status: 7 ep_create failed

[TENCENT64:5486 :0:5486] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x38)
W0416 12:19:50.088000 5421 torch/multiprocessing/spawn.py:160] Terminating process 5486 via signal SIGTERM
W0416 12:19:50.088000 5421 torch/multiprocessing/spawn.py:160] Terminating process 5487 via signal SIGTERM
W0416 12:19:50.088000 5421 torch/multiprocessing/spawn.py:160] Terminating process 5488 via signal SIGTERM
W0416 12:19:50.089000 5421 torch/multiprocessing/spawn.py:160] Terminating process 5489 via signal SIGTERM
W0416 12:19:50.089000 5421 torch/multiprocessing/spawn.py:160] Terminating process 5491 via signal SIGTERM
W0416 12:19:50.090000 5421 torch/multiprocessing/spawn.py:160] Terminating process 5492 via signal SIGTERM
W0416 12:19:50.090000 5421 torch/multiprocessing/spawn.py:160] Terminating process 5493 via signal SIGTERM
Traceback (most recent call last):
  File "/sgl-workspace/DeepEP/tests/test_internode.py", line 244, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 184, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 4 terminated with signal SIGSEGV

Environment:

nvidia-smi

Wed Apr 16 12:38:56 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H20                     On  | 00000000:03:00.0 Off |                    0 |
| N/A   38C    P0             124W / 500W |  83026MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H20                     On  | 00000000:16:00.0 Off |                    0 |
| N/A   38C    P0             123W / 500W |  83026MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H20                     On  | 00000000:1C:00.0 Off |                    0 |
| N/A   33C    P0             113W / 500W |  83026MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H20                     On  | 00000000:2E:00.0 Off |                    0 |
| N/A   33C    P0             115W / 500W |  83026MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H20                     On  | 00000000:84:00.0 Off |                    0 |
| N/A   37C    P0             121W / 500W |  83026MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H20                     On  | 00000000:9C:00.0 Off |                    0 |
| N/A   39C    P0             122W / 500W |  83026MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H20                     On  | 00000000:B6:00.0 Off |                    0 |
| N/A   32C    P0             117W / 500W |  83026MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H20                     On  | 00000000:BB:00.0 Off |                    0 |
| N/A   33C    P0             119W / 500W |  83026MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

nvshmem-info -a

nvshmem-info -a
NVSHMEM v3.2.5

Build Information:
 CUDA API                     12040
 CUDA Driver                  12040
 Build Timestamp              Apr 16 2025 06:59:41
 Build Variables             
	NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
	NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF
	NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF
	NVSHMEM_IBGDA_SUPPORT=ON NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF
	NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON
	NVSHMEM_MPI_SUPPORT=OFF NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF
	NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF
	NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF NVSHMEM_UCX_SUPPORT=OFF
	NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=OFF NVSHMEM_USE_GDRCOPY=ON
	NVSHMEM_VERBOSE=OFF CUDA_HOME=/usr/local/cuda
	LIBFABRIC_HOME=/usr/local/libfabric MPI_HOME=/usr/local/ompi
	NCCL_HOME=/usr/local/nccl NVSHMEM_PREFIX=/usr/local/nvshmem PMIX_HOME=/usr
	SHMEM_HOME=/usr/local/ompi UCX_HOME=/usr/local/ucx

Standard options:
 NVSHMEM_VERSION              false (type: bool, default: false)
	Print library version at startup
 NVSHMEM_INFO                 false (type: bool, default: false)
	Print environment variable options at startup
 NVSHMEM_DISABLE_NVLS         false (type: bool, default: false)
	Disable NVLS SHARP resources for collectives, even if available for platform
 NVSHMEM_SYMMETRIC_SIZE       1073741824 (type: size, default: 1073741824)
	Specifies the size (in bytes) of the symmetric heap memory per PE. The
	size is implementation-defined and must be at least as large as the integer
	ceiling of the product of the numeric prefix and the scaling factor. The
	character suffixes for the scaling factor are as follows:
	
	  *  k or K multiplies by 2^10 (kibibytes)
	  *  m or M multiplies by 2^20 (mebibytes)
	  *  g or G multiplies by 2^30 (gibibytes)
	  *  t or T multiplies by 2^40 (tebibytes)
	
	For example, string '20m' is equivalent to the integer value 20971520, or 20
	mebibytes. Similarly the string '3.1M' is equivalent to the integer value
	3250586. Only one multiplier is recognized and any characters following the
	multiplier are ignored, so '20kk' will not produce the same result as '20m'.
	Usage of string '.5m' will yield the same result as the string '0.5m'.
	An invalid value for NVSHMEM_SYMMETRIC_SIZE is an error, which the NVSHMEM
	library shall report by either returning a nonzero value from
	nvshmem_init_thread or causing program termination.
 NVSHMEM_DEBUG                "" (type: string, default: "")
	Set to enable debugging messages.
	Optional values: VERSION, WARN, INFO, ABORT, TRACE

Bootstrap options:
 NVSHMEM_BOOTSTRAP            "PMI" (type: string, default: "PMI")
	Name of the default bootstrap that should be used to initialize NVSHMEM.
	Allowed values: PMI, MPI, SHMEM, plugin, UID
 NVSHMEM_BOOTSTRAP_PMI        "PMI" (type: string, default: "PMI")
	Name of the PMI bootstrap that should be used to initialize NVSHMEM.
	Allowed values: PMI, PMI-2, PMIX
 NVSHMEM_BOOTSTRAP_PLUGIN     "" (type: string, default: "")
	Absolute path to or name of the bootstrap plugin file to load when
	NVSHMEM_BOOTSTRAP=plugin is specified
 NVSHMEM_BOOTSTRAP_MPI_PLUGIN "nvshmem_bootstrap_mpi.so.3" (type: string, default: "nvshmem_bootstrap_mpi.so.3")
	Absolute path to or name of the MPI bootstrap plugin file. 
	NVSHMEM will search for the plugin based on linux linker priorities. See man
	dlopen
 NVSHMEM_BOOTSTRAP_SHMEM_PLUGIN "nvshmem_bootstrap_shmem.so.3" (type: string, default: "nvshmem_bootstrap_shmem.so.3")
	Absolute path to or name of the SHMEM bootstrap plugin file. 
	NVSHMEM will search for the plugin based on linux linker priorities. See man
	dlopen
 NVSHMEM_BOOTSTRAP_PMI_PLUGIN "nvshmem_bootstrap_pmi.so.3" (type: string, default: "nvshmem_bootstrap_pmi.so.3")
	Absolute path to or name of the PMI bootstrap plugin file. 
	NVSHMEM will search for the plugin based on linux linker priorities. See man
	dlopen
 NVSHMEM_BOOTSTRAP_PMI2_PLUGIN "nvshmem_bootstrap_pmi2.so.3" (type: string, default: "nvshmem_bootstrap_pmi2.so.3")
	Absolute path to or name of the PMI-2 bootstrap plugin file. 
	NVSHMEM will search for the plugin based on linux linker priorities. See man
	dlopen
 NVSHMEM_BOOTSTRAP_PMIX_PLUGIN "nvshmem_bootstrap_pmix.so.3" (type: string, default: "nvshmem_bootstrap_pmix.so.3")
	Absolute path to or name of the PMIx bootstrap plugin file. 
	NVSHMEM will search for the plugin based on linux linker priorities. See man
	dlopen
 NVSHMEM_BOOTSTRAP_UID_PLUGIN "nvshmem_bootstrap_uid.so.3" (type: string, default: "nvshmem_bootstrap_uid.so.3")
	Absolute path to or name of the UID bootstrap plugin file. 
	NVSHMEM will search for the plugin based on linux linker priorities. See man
	dlopen

Additional options:
 NVSHMEM_CUDA_PATH            "" (type: string, default: "")
	Path to directory containing libcuda.so (for use when not in default location)
 NVSHMEM_DEBUG_ATTACH_DELAY   0 (type: int, default: 0)
	Delay (in seconds) during the first call to NVSHMEM_INIT to allow for attaching
	a debuggger (Default 0)
 NVSHMEM_DEBUG_FILE           "" (type: string, default: "")
	Debugging output filename, may contain %h for hostname and %p for pid
 NVSHMEM_MAX_TEAMS            32 (type: long, default: 32)
	Maximum number of simultaneous teams allowed
 NVSHMEM_MAX_MEMORY_PER_GPU   137438953472 (type: size, default: 137438953472)
	Maximum memory per GPU
 NVSHMEM_DISABLE_CUDA_VMM     false (type: bool, default: false)
	Disable use of CUDA VMM for P2P memory mapping. By default, CUDA VMM is enabled
	on x86 and disabled on P9. CUDA VMM feature in NVSHMEM requires CUDA RT version
	and CUDA Driver version to be greater than or equal to 11.3.
 NVSHMEM_DISABLE_P2P          false (type: bool, default: false)
	Disable P2P connectivity of GPUs even when available
 NVSHMEM_IGNORE_CUDA_MPS_ACTIVE_THREAD_PERCENTAGE false (type: bool, default: false)
	When doing Multi-Process Per GPU (MPG) run, full API support is available only
	if sum of CUDA_MPS_ACTIVE_THREAD_PERCENTAGE of processes running on a GPU is <=
	100%. Through this variable, user can request NVSHMEM runtime to ignore the
	active thread percentage and allow full MPG support. Users enable it at their
	own risk as NVSHMEM might deadlock.
 NVSHMEM_CUMEM_GRANULARITY    536870912 (type: size, default: 536870912)
	Granularity for cuMemAlloc/cuMemCreate
 NVSHMEM_PROXY_REQUEST_BATCH_MAX 32 (type: int, default: 32)
	Maxmum number of requests that the proxy thread processes in a single iteration
	of the progress loop.

Collectives options:
 NVSHMEM_DISABLE_NCCL         false (type: bool, default: false)
	Disable use of NCCL for collective operations
 NVSHMEM_BARRIER_DISSEM_KVAL  2 (type: int, default: 2)
	Radix of the dissemination algorithm used for barriers
 NVSHMEM_BARRIER_TG_DISSEM_KVAL 2 (type: int, default: 2)
	Radix of the dissemination algorithm used for thread group barriers
 NVSHMEM_FCOLLECT_LL_THRESHOLD 2048 (type: size, default: 2048)
	Message size threshold up to which fcollect LL algo will be used
	
 NVSHMEM_REDUCE_SCRATCH_SIZE  524288 (type: size, default: 524288)
	Amount of symmetric heap memory (minimum 16B, multiple of 8B) reserved by
	runtime for every team to implement reduce and reducescatter collectives
	
 NVSHMEM_BCAST_ALGO           0 (type: int, default: 0)
	Broadcast algorithm to be used.
	  * 0 - use default algorithm selection strategy
	
 NVSHMEM_REDMAXLOC_ALGO       1 (type: int, default: 1)
	Reduction algorithm to be used for MAXLOC operation.
	  * 1 - default, flag alltoall algorithm
	  * 2 - flat reduce + flat bcast
	  * 3 - topo-aware two-level reduce + topo-aware bcast
	

Transport options:
 NVSHMEM_REMOTE_TRANSPORT     "ibrc" (type: string, default: "ibrc")
	Selected transport for remote operations: ibrc, ucx, libfabric, ibdevx, none
 NVSHMEM_ENABLE_NIC_PE_MAPPING false (type: bool, default: false)
	When not set or set to 0, a PE is assigned the NIC on the node that is closest
	to it by distance. When set to 1, NVSHMEM either assigns NICs to PEs on a
	round-robin basis or uses NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST when they
	are specified.
 NVSHMEM_DISABLE_LOCAL_ONLY_PROXY false (type: bool, default: false)
	When running on an NVLink-only configuaration (No-IB, No-UCX), completely
	disable the proxy thread. This will disable device side global exit and device
	side wait timeout polling (enabled by NVSHMEM_TIMEOUT_DEVICE_POLLING build-time
	variable) because these are processed by the proxy thread.
 NVSHMEM_IB_ENABLE_IBGDA      false (type: bool, default: false)
	Set to enable GPU-initiated communication transport.

Hidden options:
 NVSHMEM_INFO_HIDDEN          true (type: bool, default: false)
	Print hidden environment variable options at startup
 NVSHMEM_DISABLE_NVLS_SHARING true (type: bool, default: true)
	Disable NVLS SHARP resource sharing for user-defined teams
 NVSHMEM_HEAP_KIND            "DEVICE" (type: string, default: "DEVICE")
	Specify the memory kind used by the NVSHMEM symmetric heap.
	Allowed values: VIDMEM, SYSMEM
 NVSHMEM_ENABLE_RAIL_OPT      false (type: bool, default: false)
	Enable Rail Optimization when heap is in SYSMEM
 NVSHMEM_BOOTSTRAP_TWO_STAGE  false (type: bool, default: false)
	Ignore CUDA device setting during initialization,forcing two-stage
	initialization
 NVSHMEM_DEBUG_SUBSYS         "" (type: string, default: "")
	Comma separated list of debugging message sources. Prefix with '^' to exclude.
	Values: INIT, COLL, P2P, PROXY, TRANSPORT, MEM, BOOTSTRAP, TOPO, UTIL, ALL
 NVSHMEM_ENABLE_ERROR_CHECKS  false (type: bool, default: false)
	Enable error checks
 NVSHMEM_DISABLE_MNNVL        false (type: bool, default: false)
	Disable MNNVL connectivity for GPUs even when available
 NVSHMEM_CUMEM_HANDLE_TYPE    "FILE_DESCRIPTOR" (type: string, default: "FILE_DESCRIPTOR")
	Handle type for cuMemCreate. Supported are - FABRIC or FILE_DESCRIPTOR
 NVSHMEM_BYPASS_ACCESSIBILITY_CHECK false (type: bool, default: false)
	Bypass peer GPU accessbility checks
 NVSHMEM_FCOLLECT_NTHREADS    512 (type: int, default: 512)
	Sets number of threads per block for fcollect collective.
	By default, if no env is set, default value is min(max_occupancy per CTA, msg
	size per PE).
	If env is specified, value overrides the default irrespective of max occupancy
	per CTA
	
 NVSHMEM_REDUCESCATTER_NTHREADS 512 (type: int, default: 512)
	Sets number of threads per block for reducescatter collective.
	By default, if no env is set, default value is min(max_occupancy per CTA, msg
	size per PE).
	If env is specified, value overrides the default irrespective of max occupancy
	per CTA
	
 NVSHMEM_MAX_CTAS             1 (type: int, default: 1)
	Sets number of blocks per grid for host onstream collective.
	By default, if no env is set, default value to 1 CTA
	If env is specified, value overrides the default value
	
 NVSHMEM_REDUCE_RECEXCH_KVAL  2 (type: int, default: 2)
	Radix of the recursive exchange reduction algorithm
 NVSHMEM_FCOLLECT_LL128_THRESHOLD 0 (type: size, default: 0)
	Message size threshold up to which the fcollect LL128 algo will be used.
	LL128 will be used only when FCOLLECT_LL_THRESHOLD < size
 NVSHMEM_FCOLLECT_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
	Message size threshold up to which fcollect NVLS algo will be used
	
 NVSHMEM_REDUCESCATTER_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
	Message size threshold up to which reducescatter NVLS algo will be used
	
 NVSHMEM_BCAST_TREE_KVAL      2 (type: int, default: 2)
	Radix of the broadcast tree algorithm
 NVSHMEM_FCOLLECT_ALGO        0 (type: int, default: 0)
	Fcollect algorithm to be used. 
	  * 0 - use default algorithm selection strategy
	
 NVSHMEM_REDUCE_ALGO          0 (type: int, default: 0)
	Allreduce algorithm to be used. 
	   * 0/1 - use default algorithm selection strategy
	
 NVSHMEM_REDUCE_NVLS_THRESHOLD 2048 (type: int, default: 2048)
	Message size threshold up to which allreduce one-shot algo will be used
	
 NVSHMEM_REDUCESCATTER_ALGO   0 (type: int, default: 0)
	Reduce Scatter algorithm to be used. 
	  * 0 - use default algorithm selection strategy
	
 NVSHMEM_ASSERT_ATOMICS_SYNC  false (type: bool, default: false)
	Bypass flush on wait_until at target
 NVSHMEM_BYPASS_FLUSH         false (type: bool, default: false)
	Bypass flush in proxy when enforcing consistency

NVTX options:
 NVSHMEM_NVTX                 "off" (type: string, default: "off")
	Set to enable NVTX instrumentation. Accepts a comma separated list of
	instrumentation groups. By default the NVTX instrumentation is disabled.
	  init                : library setup
	  alloc               : memory management
	  launch              : kernel launch routines
	  coll                : collective communications
	  wait                : blocking point-to-point synchronization
	  wait_on_stream      : point-to-point synchronization (on stream)
	  test                : non-blocking point-to-point synchronization
	  memorder            : memory ordering (quiet, fence)
	  quiet_on_stream     : nvshmemx_quiet_on_stream
	  atomic_fetch        : fetching atomic memory operations
	  atomic_set          : non-fetchong atomic memory operations
	  rma_blocking        : blocking remote memory access operations
	  rma_nonblocking     : non-blocking remote memory access operations
	  proxy               : activity of the proxy thread
	  common              : init,alloc,launch,coll,memorder,wait,atomic_fetch,rma_blocking,proxy
	  all                 : all groups
	  off                 : disable all NVTX instrumentation

Apr 16 '25 12:04 josephydu

Can you check if nvidia-peermem is correctly installed?

Apr 22 '25 03:04 sphish

same error, any solution?

Apr 28 '25 10:04 MarsMeng1994

same error, any solution?

May 08 '25 11:05 grglym