DeepEP Test test_low_latency.py failed on H20 ROCE

Issue Description

run test_low_latency.py failed. test_intranode.py and 2 H20 node test_internode.py can run nomally.

Not familiar with nvshmem, thanks a lot for your help.

Actual Result

root@iv-ydpnyxaxa85i3z3gisyp:~/DeepEP/tests# python3 test_low_latency.py
Allocating buffer size: 2116.2912 MB ...
/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1345: non-zero status: 121 Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 29e0df

/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7 ibgda_rc_init2rtr failed on RC #0.
/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1345: non-zero status: 121 WARN: Failed to initialize all selected devices. Perf may be limited.
Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 29e0df
/root/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7
connect EPS failed
/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1345: non-zero status: 121 /root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7
ibgda_rc_init2rtr failed on RC #0.Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 29e0df
/root/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7

nvshmem setup connections failed
/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7
WARN: Failed to initialize all selected devices. Perf may be limited.
ibgda_rc_init2rtr failed on RC #0.
/root/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
WARN: Failed to initialize all selected devices. Perf may be limited.

/root/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 /root/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 nvshmem setup connections failed
connect EPS failed


/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1345: non-zero status: 121 /root/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 29e0df
nvshmem setup connections failed


/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7 ibgda_rc_init2rtr failed on RC #0.
/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1345: non-zero status: 121 Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 29e0df
WARN: Failed to initialize all selected devices. Perf may be limited.

/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7 /root/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 ibgda_rc_init2rtr failed on RC #0.connect EPS failed


/root/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed
WARN: Failed to initialize all selected devices. Perf may be limited.

/root/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed

/root/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed

/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1345: non-zero status: 121 Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 29e0df

/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7 ibgda_rc_init2rtr failed on RC #37.
/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1345: non-zero status: 121 WARN: Failed to initialize all selected devices. Perf may be limited.
Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 29e0df
/root/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7
connect EPS failed
/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7
ibgda_rc_init2rtr failed on RC #1./root/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7
nvshmem setup connections failed

WARN: Failed to initialize all selected devices. Perf may be limited.
/root/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed

/root/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed

/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1345: non-zero status: 121 Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 29e0df

/root/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7 ibgda_rc_init2rtr failed on RC #1.
WARN: Failed to initialize all selected devices. Perf may be limited.
/root/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed

/root/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed

[/root/nvshmem_src/src/host/init/init.cu:536] cuda failed with an illegal memory access was encountered
[/root/nvshmem_src/src/host/init/init.cu:536] cuda failed with an illegal memory access was encountered
[/root/nvshmem_src/src/host/init/init.cu:536] cuda failed with an illegal memory access was encountered
[/root/nvshmem_src/src/host/init/init.cu:536] cuda failed with an illegal memory access was encountered
[/root/nvshmem_src/src/host/init/init.cu:536] cuda failed with an illegal memory access was encountered
[/root/nvshmem_src/src/host/init/init.cu:536] cuda failed with an illegal memory access was encountered
[/root/nvshmem_src/src/host/init/init.cu:536] cuda failed with an illegal memory access was encountered
[/root/nvshmem_src/src/host/init/init.cu:536] cuda failed with an illegal memory access was encountered
W0225 23:03:30.060000 246380 torch/multiprocessing/spawn.py:169] Terminating process 246446 via signal SIGTERM
W0225 23:03:30.060000 246380 torch/multiprocessing/spawn.py:169] Terminating process 246447 via signal SIGTERM
W0225 23:03:30.060000 246380 torch/multiprocessing/spawn.py:169] Terminating process 246448 via signal SIGTERM
W0225 23:03:30.061000 246380 torch/multiprocessing/spawn.py:169] Terminating process 246450 via signal SIGTERM
W0225 23:03:30.061000 246380 torch/multiprocessing/spawn.py:169] Terminating process 246451 via signal SIGTERM
W0225 23:03:30.061000 246380 torch/multiprocessing/spawn.py:169] Terminating process 246452 via signal SIGTERM
W0225 23:03:30.061000 246380 torch/multiprocessing/spawn.py:169] Terminating process 246453 via signal SIGTERM
Traceback (most recent call last):
  File "/root/DeepEP/tests/test_low_latency.py", line 160, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 204, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with exit code 255

Environment

Ubuntu2404

Linux 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

H20 * 8
bf3 *4 mlx5_0 is a ocp/vpc nic.

Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0216:3eff:fe43:ac58
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            200 Gb/sec (4X HDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_1' port 1 status:
        default gid:     fe80:0000:0000:0000:5e25:73ff:febc:a460
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            400 Gb/sec (4X NDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_2' port 1 status:
        default gid:     fe80:0000:0000:0000:5e25:73ff:fec1:0b24
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            400 Gb/sec (4X NDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_3' port 1 status:
        default gid:     fe80:0000:0000:0000:5e25:73ff:feb9:d7da
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            400 Gb/sec (4X NDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_4' port 1 status:
        default gid:     fe80:0000:0000:0000:5e25:73ff:fece:bef0
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            400 Gb/sec (4X NDR)
        link_layer:      Ethernet

CUDA 12.3

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0

nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H20                     On  | 00000000:65:02.0 Off |                    0 |
| N/A   27C    P0              74W / 500W |      0MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H20                     On  | 00000000:65:03.0 Off |                    0 |
| N/A   30C    P0              72W / 500W |      0MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H20                     On  | 00000000:67:02.0 Off |                    0 |
| N/A   30C    P0              71W / 500W |      0MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H20                     On  | 00000000:67:03.0 Off |                    0 |
| N/A   28C    P0              74W / 500W |      0MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H20                     On  | 00000000:69:02.0 Off |                    0 |
| N/A   28C    P0              70W / 500W |      0MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H20                     On  | 00000000:69:03.0 Off |                    0 |
| N/A   30C    P0              73W / 500W |      0MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H20                     On  | 00000000:6B:02.0 Off |                    0 |
| N/A   30C    P0              71W / 500W |      0MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H20                     On  | 00000000:6B:03.0 Off |                    0 |
| N/A   27C    P0              72W / 500W |      0MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+


        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     PIX     NODE    SYS     SYS     90-179  1               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     PIX     NODE    SYS     SYS     90-179  1               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     NODE    PIX     SYS     SYS     90-179  1               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     NODE    PIX     SYS     SYS     90-179  1               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     PIX     NODE    0-89    0               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     PIX     NODE    0-89    0               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     NODE    PIX     0-89    0               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     NODE    PIX     0-89    0               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC1    PIX     PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC2    NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC3    SYS     SYS     SYS     SYS     PIX     PIX     NODE    NODE    SYS     SYS     SYS      X      NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PIX     SYS     SYS     SYS     NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4

nvshmem-info -a

NVSHMEM v3.1.7

Build Information:
  CUDA API                     12030
  CUDA Driver                  12020
  Build Timestamp              Feb 25 2025 22:10:09
  Build Variables
        NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
        NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON
        NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF
        NVSHMEM_IBGDA_SUPPORT=ON NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF
        NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON
        NVSHMEM_MPI_SUPPORT=ON NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF
        NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF
        NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF NVSHMEM_UCX_SUPPORT=OFF
        NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=OFF NVSHMEM_USE_GDRCOPY=ON
        NVSHMEM_VERBOSE=OFF CUDA_HOME=/usr/local/cuda GDRCOPY_HOME=/usr/local/gdrdrv
        LIBFABRIC_HOME=/usr/local/libfabric MPI_HOME=/usr/local/ompi
        NCCL_HOME=/usr/local/nccl NVSHMEM_PREFIX=/usr/local/nvshmem PMIX_HOME=/usr
        SHMEM_HOME=/usr/local/ompi UCX_HOME=/usr/local/ucx

Standard options:
  NVSHMEM_VERSION              false (type: bool, default: false)
        Print library version at startup
  NVSHMEM_INFO                 false (type: bool, default: false)
        Print environment variable options at startup
  NVSHMEM_DISABLE_NVLS         false (type: bool, default: false)
        Disable NVLS SHARP resources for collectives, even if available for platform
  NVSHMEM_SYMMETRIC_SIZE       1073741824 (type: size, default: 1073741824)
        Specifies the size (in bytes) of the symmetric heap memory per PE. The
        size is implementation-defined and must be at least as large as the integer
        ceiling of the product of the numeric prefix and the scaling factor. The
        character suffixes for the scaling factor are as follows:

          *  k or K multiplies by 2^10 (kibibytes)
          *  m or M multiplies by 2^20 (mebibytes)
          *  g or G multiplies by 2^30 (gibibytes)
          *  t or T multiplies by 2^40 (tebibytes)

        For example, string '20m' is equivalent to the integer value 20971520, or 20
        mebibytes. Similarly the string '3.1M' is equivalent to the integer value
        3250586. Only one multiplier is recognized and any characters following the
        multiplier are ignored, so '20kk' will not produce the same result as '20m'.
        Usage of string '.5m' will yield the same result as the string '0.5m'.
        An invalid value for NVSHMEM_SYMMETRIC_SIZE is an error, which the NVSHMEM
        library shall report by either returning a nonzero value from
        nvshmem_init_thread or causing program termination.
  NVSHMEM_DEBUG                "" (type: string, default: "")
        Set to enable debugging messages.
        Optional values: VERSION, WARN, INFO, ABORT, TRACE

Bootstrap options:
  NVSHMEM_BOOTSTRAP            "PMI" (type: string, default: "PMI")
        Name of the default bootstrap that should be used to initialize NVSHMEM.
        Allowed values: PMI, MPI, SHMEM, plugin, UID
  NVSHMEM_BOOTSTRAP_PMI        "PMI" (type: string, default: "PMI")
        Name of the PMI bootstrap that should be used to initialize NVSHMEM.
        Allowed values: PMI, PMI-2, PMIX
  NVSHMEM_BOOTSTRAP_PLUGIN     "" (type: string, default: "")
        Absolute path to or name of the bootstrap plugin file to load when
        NVSHMEM_BOOTSTRAP=plugin is specified
  NVSHMEM_BOOTSTRAP_MPI_PLUGIN "nvshmem_bootstrap_mpi.so" (type: string, default: "nvshmem_bootstrap_mpi.so")
        Absolute path to or name of the MPI bootstrap plugin file.
        NVSHMEM will search for the plugin based on linux linker priorities. See man
        dlopen
  NVSHMEM_BOOTSTRAP_SHMEM_PLUGIN "nvshmem_bootstrap_shmem.so" (type: string, default: "nvshmem_bootstrap_shmem.so")
        Absolute path to or name of the SHMEM bootstrap plugin file.
        NVSHMEM will search for the plugin based on linux linker priorities. See man
        dlopen
  NVSHMEM_BOOTSTRAP_PMI_PLUGIN "nvshmem_bootstrap_pmi.so" (type: string, default: "nvshmem_bootstrap_pmi.so")
        Absolute path to or name of the PMI bootstrap plugin file.
        NVSHMEM will search for the plugin based on linux linker priorities. See man
        dlopen
  NVSHMEM_BOOTSTRAP_PMI2_PLUGIN "nvshmem_bootstrap_pmi2.so" (type: string, default: "nvshmem_bootstrap_pmi2.so")
        Absolute path to or name of the PMI-2 bootstrap plugin file.
        NVSHMEM will search for the plugin based on linux linker priorities. See man
        dlopen
  NVSHMEM_BOOTSTRAP_PMIX_PLUGIN "nvshmem_bootstrap_pmix.so" (type: string, default: "nvshmem_bootstrap_pmix.so")
        Absolute path to or name of the PMIx bootstrap plugin file.
        NVSHMEM will search for the plugin based on linux linker priorities. See man
        dlopen
  NVSHMEM_BOOTSTRAP_UID_PLUGIN "nvshmem_bootstrap_uid.so" (type: string, default: "nvshmem_bootstrap_uid.so")
        Absolute path to or name of the UID bootstrap plugin file.
        NVSHMEM will search for the plugin based on linux linker priorities. See man
        dlopen

Additional options:
  NVSHMEM_CUDA_PATH            "" (type: string, default: "")
        Path to directory containing libcuda.so (for use when not in default location)
  NVSHMEM_DEBUG_ATTACH_DELAY   0 (type: int, default: 0)
        Delay (in seconds) during the first call to NVSHMEM_INIT to allow for attaching
        a debuggger (Default 0)
  NVSHMEM_DEBUG_FILE           "" (type: string, default: "")
        Debugging output filename, may contain %h for hostname and %p for pid
  NVSHMEM_MAX_TEAMS            32 (type: long, default: 32)
        Maximum number of simultaneous teams allowed
  NVSHMEM_MAX_P2P_GPUS         128 (type: int, default: 128)
        Maximum number of P2P GPUs
  NVSHMEM_MAX_MEMORY_PER_GPU   137438953472 (type: size, default: 137438953472)
        Maximum memory per GPU
  NVSHMEM_DISABLE_CUDA_VMM     false (type: bool, default: false)
        Disable use of CUDA VMM for P2P memory mapping. By default, CUDA VMM is enabled
        on x86 and disabled on P9. CUDA VMM feature in NVSHMEM requires CUDA RT version
        and CUDA Driver version to be greater than or equal to 11.3.
  NVSHMEM_DISABLE_P2P          false (type: bool, default: false)
        Disable P2P connectivity of GPUs even when available
  NVSHMEM_IGNORE_CUDA_MPS_ACTIVE_THREAD_PERCENTAGE false (type: bool, default: false)
        When doing Multi-Process Per GPU (MPG) run, full API support is available only
        if sum of CUDA_MPS_ACTIVE_THREAD_PERCENTAGE of processes running on a GPU is <=
        100%. Through this variable, user can request NVSHMEM runtime to ignore the
        active thread percentage and allow full MPG support. Users enable it at their
        own risk as NVSHMEM might deadlock.
  NVSHMEM_CUMEM_GRANULARITY    536870912 (type: size, default: 536870912)
        Granularity for cuMemAlloc/cuMemCreate
  NVSHMEM_PROXY_REQUEST_BATCH_MAX 32 (type: int, default: 32)
        Maxmum number of requests that the proxy thread processes in a single iteration
        of the progress loop.

Collectives options:
  NVSHMEM_DISABLE_NCCL         false (type: bool, default: false)
        Disable use of NCCL for collective operations
  NVSHMEM_BARRIER_DISSEM_KVAL  2 (type: int, default: 2)
        Radix of the dissemination algorithm used for barriers
  NVSHMEM_BARRIER_TG_DISSEM_KVAL 2 (type: int, default: 2)
        Radix of the dissemination algorithm used for thread group barriers
  NVSHMEM_FCOLLECT_LL_THRESHOLD 2048 (type: size, default: 2048)
        Message size threshold up to which fcollect LL algo will be used

  NVSHMEM_REDUCE_SCRATCH_SIZE  524288 (type: size, default: 524288)
        Amount of symmetric heap memory (minimum 16B, multiple of 8B) reserved by
        runtime for every team to implement reduce and reducescatter collectives

  NVSHMEM_BCAST_ALGO           0 (type: int, default: 0)
        Broadcast algorithm to be used.
          * 0 - use default algorithm selection strategy

  NVSHMEM_REDMAXLOC_ALGO       1 (type: int, default: 1)
        Reduction algorithm to be used for MAXLOC operation.
          * 1 - default, flag alltoall algorithm
          * 2 - flat reduce + flat bcast
          * 3 - topo-aware two-level reduce + topo-aware bcast


Transport options:
  NVSHMEM_REMOTE_TRANSPORT     "ibrc" (type: string, default: "ibrc")
        Selected transport for remote operations: ibrc, ucx, libfabric, ibdevx, none
  NVSHMEM_ENABLE_NIC_PE_MAPPING false (type: bool, default: false)
        When not set or set to 0, a PE is assigned the NIC on the node that is closest
        to it by distance. When set to 1, NVSHMEM either assigns NICs to PEs on a
        round-robin basis or uses NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST when they
        are specified.
  NVSHMEM_DISABLE_LOCAL_ONLY_PROXY false (type: bool, default: false)
        When running on an NVLink-only configuaration (No-IB, No-UCX), completely
        disable the proxy thread. This will disable device side global exit and device
        side wait timeout polling (enabled by NVSHMEM_TIMEOUT_DEVICE_POLLING build-time
        variable) because these are processed by the proxy thread.
  NVSHMEM_IB_ENABLE_IBGDA      false (type: bool, default: false)
        Set to enable GPU-initiated communication transport.

Hidden options:
  NVSHMEM_INFO_HIDDEN          true (type: bool, default: false)
        Print hidden environment variable options at startup
  NVSHMEM_DISABLE_NVLS_SHARING true (type: bool, default: true)
        Disable NVLS SHARP resource sharing for user-defined teams
  NVSHMEM_HEAP_KIND            "DEVICE" (type: string, default: "DEVICE")
        Specify the memory kind used by the NVSHMEM symmetric heap.
        Allowed values: VIDMEM, SYSMEM
  NVSHMEM_ENABLE_RAIL_OPT      false (type: bool, default: false)
        Enable Rail Optimization when heap is in SYSMEM
  NVSHMEM_BOOTSTRAP_TWO_STAGE  false (type: bool, default: false)
        Ignore CUDA device setting during initialization,forcing two-stage
        initialization
  NVSHMEM_DEBUG_SUBSYS         "" (type: string, default: "")
        Comma separated list of debugging message sources. Prefix with '^' to exclude.
        Values: INIT, COLL, P2P, PROXY, TRANSPORT, MEM, BOOTSTRAP, TOPO, UTIL, ALL
  NVSHMEM_ENABLE_ERROR_CHECKS  false (type: bool, default: false)
        Enable error checks
  NVSHMEM_DISABLE_MNNVL        false (type: bool, default: false)
        Disable MNNVL connectivity for GPUs even when available
  NVSHMEM_CUMEM_HANDLE_TYPE    "FILE_DESCRIPTOR" (type: string, default: "FILE_DESCRIPTOR")
        Handle type for cuMemCreate. Supported are - FABRIC or FILE_DESCRIPTOR
  NVSHMEM_BYPASS_ACCESSIBILITY_CHECK false (type: bool, default: false)
        Bypass peer GPU accessbility checks
  NVSHMEM_FCOLLECT_NTHREADS    512 (type: int, default: 512)
        Sets number of threads per block for fcollect collective.
        By default, if no env is set, default value is min(max_occupancy per CTA, msg
        size per PE).
        If env is specified, value overrides the default irrespective of max occupancy
        per CTA

  NVSHMEM_REDUCESCATTER_NTHREADS 512 (type: int, default: 512)
        Sets number of threads per block for reducescatter collective.
        By default, if no env is set, default value is min(max_occupancy per CTA, msg
        size per PE).
        If env is specified, value overrides the default irrespective of max occupancy
        per CTA

  NVSHMEM_MAX_CTAS             1 (type: int, default: 1)
        Sets number of blocks per grid for host onstream collective.
        By default, if no env is set, default value to 1 CTA
        If env is specified, value overrides the default value

  NVSHMEM_REDUCE_RECEXCH_KVAL  2 (type: int, default: 2)
        Radix of the recursive exchange reduction algorithm
  NVSHMEM_FCOLLECT_LL128_THRESHOLD 0 (type: size, default: 0)
        Message size threshold up to which the fcollect LL128 algo will be used.
        LL128 will be used only when FCOLLECT_LL_THRESHOLD < size
  NVSHMEM_FCOLLECT_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
        Message size threshold up to which fcollect NVLS algo will be used

  NVSHMEM_REDUCESCATTER_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
        Message size threshold up to which reducescatter NVLS algo will be used

  NVSHMEM_BCAST_TREE_KVAL      2 (type: int, default: 2)
        Radix of the broadcast tree algorithm
  NVSHMEM_FCOLLECT_ALGO        0 (type: int, default: 0)
        Fcollect algorithm to be used.
          * 0 - use default algorithm selection strategy

  NVSHMEM_REDUCE_ALGO          0 (type: int, default: 0)
        Allreduce algorithm to be used.
           * 0 - use default algorithm selection strategy

  NVSHMEM_REDUCESCATTER_ALGO   0 (type: int, default: 0)
        Reduce Scatter algorithm to be used.
          * 0 - use default algorithm selection strategy

  NVSHMEM_ASSERT_ATOMICS_SYNC  false (type: bool, default: false)
        Bypass flush on wait_until at target
  NVSHMEM_BYPASS_FLUSH         false (type: bool, default: false)
        Bypass flush in proxy when enforcing consistency

NVTX options:
  NVSHMEM_NVTX                 "off" (type: string, default: "off")
        Set to enable NVTX instrumentation. Accepts a comma separated list of
        instrumentation groups. By default the NVTX instrumentation is disabled.
          init                : library setup
          alloc               : memory management
          launch              : kernel launch routines
          coll                : collective communications
          wait                : blocking point-to-point synchronization
          wait_on_stream      : point-to-point synchronization (on stream)
          test                : non-blocking point-to-point synchronization
          memorder            : memory ordering (quiet, fence)
          quiet_on_stream     : nvshmemx_quiet_on_stream
          atomic_fetch        : fetching atomic memory operations
          atomic_set          : non-fetchong atomic memory operations
          rma_blocking        : blocking remote memory access operations
          rma_nonblocking     : non-blocking remote memory access operations
          proxy               : activity of the proxy thread
          common              : init,alloc,launch,coll,memorder,wait,atomic_fetch,rma_blocking,proxy
          all                 : all groups
          off                 : disable all NVTX instrumentation

Additional Information

Please provide any other relevant information, such as error logs, screenshots, etc.

Feb 25 '25 15:02 HanHan009527

I have the same issue with HanHan009527. How can this problem be solved?

Feb 25 '25 15:02 zdddddda

The error logs indicate an NVSHMEM-related issue during the initialization of QPs (Queue Pairs). It is recommended to contact NVIDIA technical support to provide necessary assistance for RoCE devices, particularly for enabling IBGDA functionality on BlueField-3 (BF3) adapters.

If you're experiencing the same issue, please share your feedback with the NVSHMEM development team to help investigate and resolve the problem.

Feb 25 '25 15:02 haswelliris

The error logs indicate an NVSHMEM-related issue during the initialization of QPs (Queue Pairs). It is recommended to contact NVIDIA technical support to provide necessary assistance for RoCE devices, particularly for enabling IBGDA functionality on BlueField-3 (BF3) adapters.

If you're experiencing the same issue, please share your feedback with the NVSHMEM development team to help investigate and resolve the problem.

thanks, Let me try NVSHMEM examples on BlueField-3, and contact NVIDIA technical support.

Feb 26 '25 02:02 HanHan009527

I ran into this problem while doing nvshmem perftest and it was resolved by upgrading to nvshmem_src_3.2.5-1.

Feb 26 '25 08:02 WJTian

@WJTian Thank you for your feedback! At the time of our release, the source code for NVSHMEM v3.2.5 was not yet available. I will test it soon!

Feb 26 '25 09:02 sphish

I ran into this problem while doing nvshmem perftest and it was resolved by upgrading to nvshmem_src_3.2.5-1.

I tried this solution and luckily the patch file can be apply directly, it really works. LOL thanks a lot. @WJTian @sphish

Feb 26 '25 12:02 HanHan009527

I ran into this problem while doing nvshmem perftest and it was resolved by upgrading to nvshmem_src_3.2.5-1.

I tried this solution and luckily the patch file can be apply directly, it really works. LOL thanks a lot. @WJTian @sphish

What is the matter? I got errors when applying the patch to nvshmem_src_3.2.5-1:

error: patch failed: CMakeLists.txt:140
error: CMakeLists.txt: patch does not apply
error: patch failed: src/modules/transport/ibgda/ibgda.cpp:2405
error: src/modules/transport/ibgda/ibgda.cpp: patch does not apply
error: patch failed: src/modules/transport/ibgda/ibgda.cpp:2401
error: src/modules/transport/ibgda/ibgda.cpp: patch does not apply

Feb 26 '25 14:02 kefins

Besides the patch problem. I got antother other while compiling DeepEP with nvshmem_src_3.2.5-1.

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/6] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/runtime.o.d -Icsrc/ -I/opt/nvshmem/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /root/DeepEP/csrc/kernels/runtime.cu -o /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/runtime.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -Xcompiler -O3 -rdc=true --ptxas-options=--register-usage-level=10 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=deep_ep_cpp -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_90,code=sm_90 -std=c++17
FAILED: /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/runtime.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/runtime.o.d -Icsrc/ -I/opt/nvshmem/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /root/DeepEP/csrc/kernels/runtime.cu -o /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/runtime.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -Xcompiler -O3 -rdc=true --ptxas-options=--register-usage-level=10 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=deep_ep_cpp -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_90,code=sm_90 -std=c++17
/root/DeepEP/csrc/kernels/ibgda_device.cuh(258): error: class "nvshmemi_ibgda_device_qp" has no member "rx_wq"
      auto cq = qp->rx_wq.cq;
                    ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(261): error: cannot deduce "auto" type
      auto *cqe64 = reinterpret_cast<struct mlx5_cqe64*>(cq->cqe);
      ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(345): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
      auto resv_head = mvars->rx_wq.resv_head;
                              ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(346): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
      auto num_valid_slots = resv_head - mvars->rx_wq.cons_idx;
                                                ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(348): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
          resv_head = mvars->rx_wq.cons_idx + qp->rx_wq.nwqes;
                             ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(348): error: class "nvshmemi_ibgda_device_qp" has no member "rx_wq"
          resv_head = mvars->rx_wq.cons_idx + qp->rx_wq.nwqes;
                                                  ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(349): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
          mvars->rx_wq.resv_head = resv_head;
                 ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(353): error: class "nvshmemi_ibgda_device_qp" has no member "rx_wq"
          __be32 *dbrec_ptr = qp->rx_wq.dbrec;
                                  ^

/root/DeepEP/csrc/kernels/runtime.cu(54): error: class "nvshmemi_ibgda_device_qp" has no member "rx_wq"
              for (int i = 0; i < qp->rx_wq.nwqes; ++ i)
                                      ^

/root/DeepEP/csrc/kernels/runtime.cu(56): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
              qp->mvars.rx_wq.resv_head = 0;
                        ^

/root/DeepEP/csrc/kernels/runtime.cu(57): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
              qp->mvars.rx_wq.cons_idx = 0;
                        ^

11 errors detected in the compilation of "/root/DeepEP/csrc/kernels/runtime.cu".
[2/6] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/internode_ll.o.d -Icsrc/ -I/opt/nvshmem/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /root/DeepEP/csrc/kernels/internode_ll.cu -o /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/internode_ll.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -Xcompiler -O3 -rdc=true --ptxas-options=--register-usage-level=10 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=deep_ep_cpp -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_90,code=sm_90 -std=c++17
FAILED: /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/internode_ll.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/internode_ll.o.d -Icsrc/ -I/opt/nvshmem/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /root/DeepEP/csrc/kernels/internode_ll.cu -o /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/internode_ll.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -Xcompiler -O3 -rdc=true --ptxas-options=--register-usage-level=10 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=deep_ep_cpp -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_90,code=sm_90 -std=c++17
/root/DeepEP/csrc/kernels/ibgda_device.cuh(258): error: class "nvshmemi_ibgda_device_qp" has no member "rx_wq"
      auto cq = qp->rx_wq.cq;
                    ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(261): error: cannot deduce "auto" type
      auto *cqe64 = reinterpret_cast<struct mlx5_cqe64*>(cq->cqe);
      ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(345): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
      auto resv_head = mvars->rx_wq.resv_head;
                              ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(346): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
      auto num_valid_slots = resv_head - mvars->rx_wq.cons_idx;
                                                ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(348): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
          resv_head = mvars->rx_wq.cons_idx + qp->rx_wq.nwqes;
                             ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(348): error: class "nvshmemi_ibgda_device_qp" has no member "rx_wq"
          resv_head = mvars->rx_wq.cons_idx + qp->rx_wq.nwqes;
                                                  ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(349): error: class "nvshmemi_ibgda_device_qp_management_v1" has no member "rx_wq"
          mvars->rx_wq.resv_head = resv_head;
                 ^

/root/DeepEP/csrc/kernels/ibgda_device.cuh(353): error: class "nvshmemi_ibgda_device_qp" has no member "rx_wq"
          __be32 *dbrec_ptr = qp->rx_wq.dbrec;
                                  ^

8 errors detected in the compilation of "/root/DeepEP/csrc/kernels/internode_ll.cu".
[3/6] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/intranode.o.d -Icsrc/ -I/opt/nvshmem/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /root/DeepEP/csrc/kernels/intranode.cu -o /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/intranode.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -Xcompiler -O3 -rdc=true --ptxas-options=--register-usage-level=10 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=deep_ep_cpp -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_90,code=sm_90 -std=c++17
[4/6] c++ -MMD -MF /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/deep_ep.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -Icsrc/ -I/opt/nvshmem/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /root/DeepEP/csrc/deep_ep.cpp -o /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/deep_ep.o -O3 -Wno-deprecated-declarations -Wno-unused-variable -Wno-sign-compare -Wno-reorder -Wno-attributes -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=deep_ep_cpp -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[5/6] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/internode.o.d -Icsrc/ -I/opt/nvshmem/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /root/DeepEP/csrc/kernels/internode.cu -o /root/DeepEP/build/temp.linux-x86_64-cpython-310/csrc/kernels/internode.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -Xcompiler -O3 -rdc=true --ptxas-options=--register-usage-level=10 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=deep_ep_cpp -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_90,code=sm_90 -std=c++17
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
    subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/DeepEP/setup.py", line 44, in <module>
    setuptools.setup(
  File "/usr/local/lib/python3.10/dist-packages/setuptools/__init__.py", line 117, in setup
    return distutils.core.setup(**attrs)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 186, in setup
    return run_commands(dist)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 202, in run_commands
    dist.run_commands()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 983, in run_commands
    self.run_command(cmd)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 999, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install.py", line 109, in run
    self.do_egg_install()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install.py", line 167, in do_egg_install
    self.run_command('bdist_egg')
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 339, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 999, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/bdist_egg.py", line 177, in run
    cmd = self.call_command('install_lib', warn_dir=False)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/bdist_egg.py", line 163, in call_command
    self.run_command(cmdname)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 339, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 999, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/install_lib.py", line 19, in run
    self.build()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/install_lib.py", line 110, in build
    self.run_command('build_ext')
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 339, in run_command
    self.distribution.run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 999, in run_command
    super().run_command(command)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 99, in run
    _build_ext.run(self)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 365, in run
    self.build_extensions()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 868, in build_extensions
    build_ext.build_extensions(self)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 481, in build_extensions
    self._build_extensions_serial()
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 507, in _build_extensions_serial
    self.build_extension(ext)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 264, in build_extension
    _build_ext.build_extension(self, ext)
  File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 562, in build_extension
    objects = self.compiler.compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 681, in unix_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1784, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

Feb 26 '25 15:02 kefins

I have the same problem.

Feb 26 '25 15:02 Qizhi697

I have the same problem.

I think it was caused by lacking the patch.

Feb 26 '25 15:02 kefins

I ran into this problem while doing nvshmem perftest and it was resolved by upgrading to nvshmem_src_3.2.5-1.

I tried this solution and luckily the patch file can be apply directly, it really works. LOL thanks a lot. @WJTian @sphish

What is the matter? I got errors when applying the patch to nvshmem_src_3.2.5-1:

error: patch failed: CMakeLists.txt:140 error: CMakeLists.txt: patch does not apply error: patch failed: src/modules/transport/ibgda/ibgda.cpp:2405 error: src/modules/transport/ibgda/ibgda.cpp: patch does not apply error: patch failed: src/modules/transport/ibgda/ibgda.cpp:2401 error: src/modules/transport/ibgda/ibgda.cpp: patch does not apply

yes， if you use git apply patchfile， you will get this error. I use PyCharm to apply the patch file. not sure what is the differents of this tow tools.

Feb 27 '25 02:02 HanHan009527

Is there a solution for 3.1.7 ?

Feb 27 '25 08:02 Ben286

I ran into this problem while doing nvshmem perftest and it was resolved by upgrading to nvshmem_src_3.2.5-1.

Hello, have you attempted to execute the low-latency test within H20? I was able to patch into nvshmem 3.2.5, which indeed resolved the QPs issue. However, I still encountered other problems during the execution of the low-latency tests.

Mar 07 '25 04:03 Dixeran

I ran into this problem while doing nvshmem perftest and it was resolved by upgrading to nvshmem_src_3.2.5-1.

Hello, have you attempted to execute the low-latency test within H20? I was able to patch into nvshmem 3.2.5, which indeed resolved the QPs issue. However, I still encountered other problems during the execution of the low-latency tests.

We haven't tested on the h20 machine, but theoretically it should work. What issues are you encountering? @Dixeran

Mar 09 '25 02:03 sphish

I ran into this problem while doing nvshmem perftest and it was resolved by upgrading to nvshmem_src_3.2.5-1.

Hello, have you attempted to execute the low-latency test within H20? I was able to patch into nvshmem 3.2.5, which indeed resolved the QPs issue. However, I still encountered other problems during the execution of the low-latency tests.

We haven't tested on the h20 machine, but theoretically it should work. What issues are you encountering? @Dixeran

[nvshmem/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch

Seems like some hardware resources retriction compare to H800?

Mar 17 '25 09:03 Dixeran

Yes, the H20 has fewer SMs compared to the H800. Please refer to https://github.com/deepseek-ai/DeepEP/issues/15#issuecomment-2682343603.

I ran into this problem while doing nvshmem perftest and it was resolved by upgrading to nvshmem_src_3.2.5-1.

Hello, have you attempted to execute the low-latency test within H20? I was able to patch into nvshmem 3.2.5, which indeed resolved the QPs issue. However, I still encountered other problems during the execution of the low-latency tests.

We haven't tested on the h20 machine, but theoretically it should work. What issues are you encountering? @Dixeran
[nvshmem/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch
Seems like some hardware resources retriction compare to H800?

Mar 17 '25 10:03 sphish

I ran into this problem while doing nvshmem perftest and it was resolved by upgrading to nvshmem_src_3.2.5-1.

I tried this solution and luckily the patch file can be apply directly, it really works. LOL thanks a lot. @WJTian @sphish

What is the matter? I got errors when applying the patch to nvshmem_src_3.2.5-1:

error: patch failed: CMakeLists.txt:140 error: CMakeLists.txt: patch does not apply error: patch failed: src/modules/transport/ibgda/ibgda.cpp:2405 error: src/modules/transport/ibgda/ibgda.cpp: patch does not apply error: patch failed: src/modules/transport/ibgda/ibgda.cpp:2401 error: src/modules/transport/ibgda/ibgda.cpp: patch does not apply

did you already patched? This problem appear when i try to patch at second time

Apr 20 '25 07:04 xing0821

did you already patched? This problem appear when i try to patch at second time

@xing0821 Have you pulled the newest code? The latest patch should have solved this problem.

Apr 21 '25 01:04 sphish