DeepEP
DeepEP copied to clipboard
Internode low latency test dispatch & combine finished with error
When I ran internode low latency test, the dispatch and combine finished correctly, but the test case dump error printing in the end.
Node 1:
root@1e604ac9d157:/home/dpsk_a2a/DeepEP# NCCL_DEBUG=WARN MASTER_ADDR=<MASTER_IP> WORLD_SIZE=2 RANK=0 python3 tests/test_low_latency.py
Allocating buffer size: 2115.111296 MB ...
......
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 0 nranks 16 tag 4 - DONE
[rank 0] Dispatch + combine bandwidth: 22.65 GB/s, avg_t=973.58 us, min_t=962.43 us, max_t=994.94 us
[rank 1] Dispatch + combine bandwidth: 22.64 GB/s, avg_t=973.81 us, min_t=961.82 us, max_t=995.49 us
[rank 6] Dispatch + combine bandwidth: 22.64 GB/s, avg_t=973.93 us, min_t=961.12 us, max_t=997.79 us
[rank 2] Dispatch + combine bandwidth: 22.64 GB/s, avg_t=973.71 us, min_t=964.48 us, max_t=987.07 us
[rank 4] Dispatch + combine bandwidth: 22.64 GB/s, avg_t=973.75 us, min_t=960.96 us, max_t=996.83 us
[rank 7] Dispatch + combine bandwidth: 22.64 GB/s, avg_t=973.71 us, min_t=964.35 us, max_t=993.25 us
[rank 3] Dispatch + combine bandwidth: 22.64 GB/s, avg_t=973.98 us, min_t=956.42 us, max_t=992.29 us
[rank 5] Dispatch + combine bandwidth: 22.65 GB/s, avg_t=973.53 us, min_t=954.53 us, max_t=996.86 us
[rank 1] Dispatch bandwidth: 21.22 GB/s, avg_t=353.99 us | Combine bandwidth: 22.93 GB/s, avg_t=634.05 us
[rank 4] Dispatch bandwidth: 21.76 GB/s, avg_t=345.21 us | Combine bandwidth: 22.92 GB/s, avg_t=634.31 us
[rank 0] Dispatch bandwidth: 22.30 GB/s, avg_t=336.84 us | Combine bandwidth: 23.01 GB/s, avg_t=631.76 us
[rank 2] Dispatch bandwidth: 21.51 GB/s, avg_t=349.20 us | Combine bandwidth: 23.01 GB/s, avg_t=631.73 us
[rank 3] Dispatch bandwidth: 21.41 GB/s, avg_t=350.90 us | Combine bandwidth: 22.91 GB/s, avg_t=634.48 us
[rank 6] Dispatch bandwidth: 22.45 GB/s, avg_t=334.54 us | Combine bandwidth: 22.94 GB/s, avg_t=633.66 us
[rank 5] Dispatch bandwidth: 22.08 GB/s, avg_t=340.25 us | Combine bandwidth: 22.92 GB/s, avg_t=634.31 us
[rank 7] Dispatch bandwidth: 22.18 GB/s, avg_t=338.64 us | Combine bandwidth: 23.41 GB/s, avg_t=621.07 us
[rank 1] Dispatch send/recv time: 28.29 us | Combine send/recv time: 28.58 us
[rank 4] Dispatch send/recv time: 28.25 us | Combine send/recv time: 28.81 us
[rank 2] Dispatch send/recv time: 28.45 us | Combine send/recv time: 28.79 us
[rank 3] Dispatch send/recv time: 28.26 us | Combine send/recv time: 28.85 us
[rank 0] Dispatch send/recv time: 27.99 us | Combine send/recv time: 28.84 us
[rank 6] Dispatch send/recv time: 27.97 us | Combine send/recv time: 28.56 us
[rank 5] Dispatch send/recv time: 28.12 us | Combine send/recv time: 28.94 us
[rank 7] Dispatch send/recv time: 28.24 us | Combine send/recv time: 29.03 us
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 4 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 0 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 2 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 5 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 7 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 1 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 6 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 3 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 6 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 7 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 5 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 3 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 1 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 2 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 4 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 0 nranks 16 tag 4 - DONE
W0513 15:52:40.025000 7102 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 7230 via signal SIGTERM
W0513 15:52:40.026000 7102 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 7231 via signal SIGTERM
W0513 15:52:40.026000 7102 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 7233 via signal SIGTERM
W0513 15:52:40.027000 7102 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 7234 via signal SIGTERM
W0513 15:52:40.027000 7102 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 7235 via signal SIGTERM
W0513 15:52:40.027000 7102 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 7236 via signal SIGTERM
W0513 15:52:40.027000 7102 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 7237 via signal SIGTERM
Traceback (most recent call last):
File "/home/dpsk_a2a/DeepEP/tests/test_low_latency.py", line 172, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
File "/opt/ac2/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ac2/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/opt/ac2/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGSEGV
Node 2:
root@0540de603a8e:/home/dpsk_a2a/DeepEP# NCCL_DEBUG=WARN MASTER_ADDR=<MASTER_IP> WORLD_SIZE=2 RANK=1 python3 tests/test_low_latency.py
Allocating buffer size: 2115.111296 MB ...
......
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 8 nranks 16 tag 4 - DONE
[rank 14] Dispatch + combine bandwidth: 22.65 GB/s, avg_t=973.49 us, min_t=958.24 us, max_t=993.06 us
[rank 10] Dispatch + combine bandwidth: 22.65 GB/s, avg_t=973.50 us, min_t=955.65 us, max_t=992.26 us
[rank 15] Dispatch + combine bandwidth: 22.65 GB/s, avg_t=973.40 us, min_t=956.29 us, max_t=994.27 us
[rank 8] Dispatch + combine bandwidth: 22.65 GB/s, avg_t=973.64 us, min_t=961.98 us, max_t=993.25 us
[rank 13] Dispatch + combine bandwidth: 22.65 GB/s, avg_t=973.62 us, min_t=959.10 us, max_t=995.14 us
[rank 12] Dispatch + combine bandwidth: 22.65 GB/s, avg_t=973.30 us, min_t=955.07 us, max_t=994.40 us
[rank 11] Dispatch + combine bandwidth: 22.65 GB/s, avg_t=973.62 us, min_t=961.54 us, max_t=991.52 us
[rank 9] Dispatch + combine bandwidth: 22.64 GB/s, avg_t=973.71 us, min_t=964.22 us, max_t=988.48 us
[rank 14] Dispatch bandwidth: 22.24 GB/s, avg_t=337.70 us | Combine bandwidth: 22.75 GB/s, avg_t=639.08 us
[rank 10] Dispatch bandwidth: 21.51 GB/s, avg_t=349.25 us | Combine bandwidth: 23.00 GB/s, avg_t=631.95 us
[rank 15] Dispatch bandwidth: 22.62 GB/s, avg_t=332.15 us | Combine bandwidth: 22.74 GB/s, avg_t=639.28 us
[rank 11] Dispatch bandwidth: 22.15 GB/s, avg_t=339.13 us | Combine bandwidth: 23.00 GB/s, avg_t=632.14 us
[rank 8] Dispatch bandwidth: 23.14 GB/s, avg_t=324.60 us | Combine bandwidth: 22.64 GB/s, avg_t=642.14 us
[rank 9] Dispatch bandwidth: 21.42 GB/s, avg_t=350.74 us | Combine bandwidth: 22.83 GB/s, avg_t=636.74 us
[rank 13] Dispatch bandwidth: 22.47 GB/s, avg_t=334.25 us | Combine bandwidth: 22.79 GB/s, avg_t=637.89 us
[rank 12] Dispatch bandwidth: 21.82 GB/s, avg_t=344.33 us | Combine bandwidth: 23.08 GB/s, avg_t=629.81 us
[rank 9] Dispatch send/recv time: 28.80 us | Combine send/recv time: 28.67 us
[rank 13] Dispatch send/recv time: 28.36 us | Combine send/recv time: 28.63 us
[rank 8] Dispatch send/recv time: 28.82 us | Combine send/recv time: 28.52 us
[rank 11] Dispatch send/recv time: 27.96 us | Combine send/recv time: 28.54 us
[rank 14] Dispatch send/recv time: 27.76 us | Combine send/recv time: 28.58 us
[rank 12] Dispatch send/recv time: 28.77 us | Combine send/recv time: 29.07 us
[rank 15] Dispatch send/recv time: 28.35 us | Combine send/recv time: 28.78 us
[rank 10] Dispatch send/recv time: 28.70 us | Combine send/recv time: 28.78 us
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 8 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 13 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 10 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 9 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 11 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 14 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 15 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:559: rank 12 nranks 16 tag 0 - ENTER
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 11 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 14 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 15 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 9 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 13 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 10 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 12 nranks 16 tag 4 - DONE
/home/dpsk_a2a/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_uid_barrier:575: rank 8 nranks 16 tag 4 - DONE
W0513 15:52:34.833000 5810 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 5939 via signal SIGTERM
W0513 15:52:34.833000 5810 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 5940 via signal SIGTERM
W0513 15:52:34.834000 5810 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 5941 via signal SIGTERM
W0513 15:52:34.834000 5810 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 5942 via signal SIGTERM
W0513 15:52:34.834000 5810 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 5943 via signal SIGTERM
W0513 15:52:34.834000 5810 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 5944 via signal SIGTERM
W0513 15:52:34.834000 5810 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 5945 via signal SIGTERM
Traceback (most recent call last):
File "/home/dpsk_a2a/DeepEP/tests/test_low_latency.py", line 172, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
File "/opt/ac2/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ac2/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/opt/ac2/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
The intranode test finished successfully with no error. The internode test normal kernel has the same error. This error occurs 100%.
Env: 2 x 8 x H20.
root@0540de603a8e:/home/dpsk_a2a/DeepEP# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE PIX SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE PIX 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS NODE NODE 48-95,144-191 1 N/A
NIC0 NODE PIX NODE NODE SYS SYS SYS SYS X NODE SYS SYS
NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE
NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
you config this env:
export NVSHMEM_ENABLE_NIC_PE_MAPPING=1
export NVSHMEM_HCA_PE_MAPPING="mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
nvshmem will SIGSEGV. I guess it's because ib ojb was repeatedly released in nvshmem.
root@1e604ac9d157:/home/dpsk_a2a/DeepEP# /opt/nvshmem/bin/nvshmem-info -a
NVSHMEM v3.2.5
Build Information:
CUDA API 12080
CUDA Driver 12080
Build Timestamp May 13 2025 22:28:19
Build Variables
NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF
NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF
NVSHMEM_IBGDA_SUPPORT=ON NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF
NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON
NVSHMEM_MPI_SUPPORT=OFF NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF
NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF
NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF NVSHMEM_UCX_SUPPORT=OFF
NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=OFF NVSHMEM_USE_GDRCOPY=ON
NVSHMEM_VERBOSE=OFF CUDA_HOME=/usr/local/cuda GDRCOPY_HOME=/opt/gdrcopy
SHMEM_HOME=/usr/local/ompi UCX_HOME=/usr/local/ucx
Standard options:
NVSHMEM_VERSION false (type: bool, default: false)
Print library version at startup
NVSHMEM_INFO false (type: bool, default: false)
Print environment variable options at startup
NVSHMEM_DISABLE_NVLS false (type: bool, default: false)
Disable NVLS SHARP resources for collectives, even if available for platform
NVSHMEM_SYMMETRIC_SIZE 1073741824 (type: size, default: 1073741824)
Specifies the size (in bytes) of the symmetric heap memory per PE. The
size is implementation-defined and must be at least as large as the integer
ceiling of the product of the numeric prefix and the scaling factor. The
character suffixes for the scaling factor are as follows:
* k or K multiplies by 2^10 (kibibytes)
* m or M multiplies by 2^20 (mebibytes)
* g or G multiplies by 2^30 (gibibytes)
* t or T multiplies by 2^40 (tebibytes)
For example, string '20m' is equivalent to the integer value 20971520, or 20
mebibytes. Similarly the string '3.1M' is equivalent to the integer value
3250586. Only one multiplier is recognized and any characters following the
multiplier are ignored, so '20kk' will not produce the same result as '20m'.
Usage of string '.5m' will yield the same result as the string '0.5m'.
An invalid value for NVSHMEM_SYMMETRIC_SIZE is an error, which the NVSHMEM
library shall report by either returning a nonzero value from
nvshmem_init_thread or causing program termination.
NVSHMEM_DEBUG "WARN" (type: string, default: "")
Set to enable debugging messages.
Optional values: VERSION, WARN, INFO, ABORT, TRACE
Bootstrap options:
NVSHMEM_BOOTSTRAP "PMI" (type: string, default: "PMI")
Name of the default bootstrap that should be used to initialize NVSHMEM.
Allowed values: PMI, MPI, SHMEM, plugin, UID
NVSHMEM_BOOTSTRAP_PMI "PMI" (type: string, default: "PMI")
Name of the PMI bootstrap that should be used to initialize NVSHMEM.
Allowed values: PMI, PMI-2, PMIX
NVSHMEM_BOOTSTRAP_PLUGIN "" (type: string, default: "")
Absolute path to or name of the bootstrap plugin file to load when
NVSHMEM_BOOTSTRAP=plugin is specified
NVSHMEM_BOOTSTRAP_MPI_PLUGIN "nvshmem_bootstrap_mpi.so.3" (type: string, default: "nvshmem_bootstrap_mpi.so.3")
Absolute path to or name of the MPI bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_SHMEM_PLUGIN "nvshmem_bootstrap_shmem.so.3" (type: string, default: "nvshmem_bootstrap_shmem.so.3")
Absolute path to or name of the SHMEM bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_PMI_PLUGIN "nvshmem_bootstrap_pmi.so.3" (type: string, default: "nvshmem_bootstrap_pmi.so.3")
Absolute path to or name of the PMI bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_PMI2_PLUGIN "nvshmem_bootstrap_pmi2.so.3" (type: string, default: "nvshmem_bootstrap_pmi2.so.3")
Absolute path to or name of the PMI-2 bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_PMIX_PLUGIN "nvshmem_bootstrap_pmix.so.3" (type: string, default: "nvshmem_bootstrap_pmix.so.3")
Absolute path to or name of the PMIx bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_UID_PLUGIN "nvshmem_bootstrap_uid.so.3" (type: string, default: "nvshmem_bootstrap_uid.so.3")
Absolute path to or name of the UID bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
Additional options:
NVSHMEM_CUDA_PATH "" (type: string, default: "")
Path to directory containing libcuda.so (for use when not in default location)
NVSHMEM_DEBUG_ATTACH_DELAY 0 (type: int, default: 0)
Delay (in seconds) during the first call to NVSHMEM_INIT to allow for attaching
a debuggger (Default 0)
NVSHMEM_DEBUG_FILE "" (type: string, default: "")
Debugging output filename, may contain %h for hostname and %p for pid
NVSHMEM_MAX_TEAMS 32 (type: long, default: 32)
Maximum number of simultaneous teams allowed
NVSHMEM_MAX_MEMORY_PER_GPU 137438953472 (type: size, default: 137438953472)
Maximum memory per GPU
NVSHMEM_DISABLE_CUDA_VMM false (type: bool, default: false)
Disable use of CUDA VMM for P2P memory mapping. By default, CUDA VMM is enabled
on x86 and disabled on P9. CUDA VMM feature in NVSHMEM requires CUDA RT version
and CUDA Driver version to be greater than or equal to 11.3.
NVSHMEM_DISABLE_P2P false (type: bool, default: false)
Disable P2P connectivity of GPUs even when available
NVSHMEM_IGNORE_CUDA_MPS_ACTIVE_THREAD_PERCENTAGE false (type: bool, default: false)
When doing Multi-Process Per GPU (MPG) run, full API support is available only
if sum of CUDA_MPS_ACTIVE_THREAD_PERCENTAGE of processes running on a GPU is <=
100%. Through this variable, user can request NVSHMEM runtime to ignore the
active thread percentage and allow full MPG support. Users enable it at their
own risk as NVSHMEM might deadlock.
NVSHMEM_CUMEM_GRANULARITY 536870912 (type: size, default: 536870912)
Granularity for cuMemAlloc/cuMemCreate
NVSHMEM_PROXY_REQUEST_BATCH_MAX 32 (type: int, default: 32)
Maxmum number of requests that the proxy thread processes in a single iteration
of the progress loop.
Collectives options:
NVSHMEM_DISABLE_NCCL false (type: bool, default: false)
Disable use of NCCL for collective operations
NVSHMEM_BARRIER_DISSEM_KVAL 2 (type: int, default: 2)
Radix of the dissemination algorithm used for barriers
NVSHMEM_BARRIER_TG_DISSEM_KVAL 2 (type: int, default: 2)
Radix of the dissemination algorithm used for thread group barriers
NVSHMEM_FCOLLECT_LL_THRESHOLD 2048 (type: size, default: 2048)
Message size threshold up to which fcollect LL algo will be used
NVSHMEM_REDUCE_SCRATCH_SIZE 524288 (type: size, default: 524288)
Amount of symmetric heap memory (minimum 16B, multiple of 8B) reserved by
runtime for every team to implement reduce and reducescatter collectives
NVSHMEM_BCAST_ALGO 0 (type: int, default: 0)
Broadcast algorithm to be used.
* 0 - use default algorithm selection strategy
NVSHMEM_REDMAXLOC_ALGO 1 (type: int, default: 1)
Reduction algorithm to be used for MAXLOC operation.
* 1 - default, flag alltoall algorithm
* 2 - flat reduce + flat bcast
* 3 - topo-aware two-level reduce + topo-aware bcast
Transport options:
NVSHMEM_REMOTE_TRANSPORT "ibrc" (type: string, default: "ibrc")
Selected transport for remote operations: ibrc, ucx, libfabric, ibdevx, none
NVSHMEM_ENABLE_NIC_PE_MAPPING true (type: bool, default: false)
When not set or set to 0, a PE is assigned the NIC on the node that is closest
to it by distance. When set to 1, NVSHMEM either assigns NICs to PEs on a
round-robin basis or uses NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST when they
are specified.
NVSHMEM_DISABLE_LOCAL_ONLY_PROXY false (type: bool, default: false)
When running on an NVLink-only configuaration (No-IB, No-UCX), completely
disable the proxy thread. This will disable device side global exit and device
side wait timeout polling (enabled by NVSHMEM_TIMEOUT_DEVICE_POLLING build-time
variable) because these are processed by the proxy thread.
NVSHMEM_IB_ENABLE_IBGDA false (type: bool, default: false)
Set to enable GPU-initiated communication transport.
Hidden options:
NVSHMEM_INFO_HIDDEN true (type: bool, default: false)
Print hidden environment variable options at startup
NVSHMEM_DISABLE_NVLS_SHARING true (type: bool, default: true)
Disable NVLS SHARP resource sharing for user-defined teams
NVSHMEM_HEAP_KIND "DEVICE" (type: string, default: "DEVICE")
Specify the memory kind used by the NVSHMEM symmetric heap.
Allowed values: VIDMEM, SYSMEM
NVSHMEM_ENABLE_RAIL_OPT false (type: bool, default: false)
Enable Rail Optimization when heap is in SYSMEM
NVSHMEM_BOOTSTRAP_TWO_STAGE false (type: bool, default: false)
Ignore CUDA device setting during initialization,forcing two-stage
initialization
NVSHMEM_DEBUG_SUBSYS "" (type: string, default: "")
Comma separated list of debugging message sources. Prefix with '^' to exclude.
Values: INIT, COLL, P2P, PROXY, TRANSPORT, MEM, BOOTSTRAP, TOPO, UTIL, ALL
NVSHMEM_ENABLE_ERROR_CHECKS false (type: bool, default: false)
Enable error checks
NVSHMEM_DISABLE_MNNVL false (type: bool, default: false)
Disable MNNVL connectivity for GPUs even when available
NVSHMEM_CUMEM_HANDLE_TYPE "FILE_DESCRIPTOR" (type: string, default: "FILE_DESCRIPTOR")
Handle type for cuMemCreate. Supported are - FABRIC or FILE_DESCRIPTOR
NVSHMEM_BYPASS_ACCESSIBILITY_CHECK false (type: bool, default: false)
Bypass peer GPU accessbility checks
NVSHMEM_FCOLLECT_NTHREADS 512 (type: int, default: 512)
Sets number of threads per block for fcollect collective.
By default, if no env is set, default value is min(max_occupancy per CTA, msg
size per PE).
If env is specified, value overrides the default irrespective of max occupancy
per CTA
NVSHMEM_REDUCESCATTER_NTHREADS 512 (type: int, default: 512)
Sets number of threads per block for reducescatter collective.
By default, if no env is set, default value is min(max_occupancy per CTA, msg
size per PE).
If env is specified, value overrides the default irrespective of max occupancy
per CTA
NVSHMEM_MAX_CTAS 1 (type: int, default: 1)
Sets number of blocks per grid for host onstream collective.
By default, if no env is set, default value to 1 CTA
If env is specified, value overrides the default value
NVSHMEM_REDUCE_RECEXCH_KVAL 2 (type: int, default: 2)
Radix of the recursive exchange reduction algorithm
NVSHMEM_FCOLLECT_LL128_THRESHOLD 0 (type: size, default: 0)
Message size threshold up to which the fcollect LL128 algo will be used.
LL128 will be used only when FCOLLECT_LL_THRESHOLD < size
NVSHMEM_FCOLLECT_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
Message size threshold up to which fcollect NVLS algo will be used
NVSHMEM_REDUCESCATTER_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
Message size threshold up to which reducescatter NVLS algo will be used
NVSHMEM_BCAST_TREE_KVAL 2 (type: int, default: 2)
Radix of the broadcast tree algorithm
NVSHMEM_FCOLLECT_ALGO 0 (type: int, default: 0)
Fcollect algorithm to be used.
* 0 - use default algorithm selection strategy
NVSHMEM_REDUCE_ALGO 0 (type: int, default: 0)
Allreduce algorithm to be used.
* 0/1 - use default algorithm selection strategy
NVSHMEM_REDUCE_NVLS_THRESHOLD 2048 (type: int, default: 2048)
Message size threshold up to which allreduce one-shot algo will be used
NVSHMEM_REDUCESCATTER_ALGO 0 (type: int, default: 0)
Reduce Scatter algorithm to be used.
* 0 - use default algorithm selection strategy
NVSHMEM_ASSERT_ATOMICS_SYNC false (type: bool, default: false)
Bypass flush on wait_until at target
NVSHMEM_BYPASS_FLUSH false (type: bool, default: false)
Bypass flush in proxy when enforcing consistency
NVTX options:
NVSHMEM_NVTX "off" (type: string, default: "off")
Set to enable NVTX instrumentation. Accepts a comma separated list of
instrumentation groups. By default the NVTX instrumentation is disabled.
init : library setup
alloc : memory management
launch : kernel launch routines
coll : collective communications
wait : blocking point-to-point synchronization
wait_on_stream : point-to-point synchronization (on stream)
test : non-blocking point-to-point synchronization
memorder : memory ordering (quiet, fence)
quiet_on_stream : nvshmemx_quiet_on_stream
atomic_fetch : fetching atomic memory operations
atomic_set : non-fetchong atomic memory operations
rma_blocking : blocking remote memory access operations
rma_nonblocking : non-blocking remote memory access operations
proxy : activity of the proxy thread
common : init,alloc,launch,coll,memorder,wait,atomic_fetch,rma_blocking,proxy
all : all groups
off : disable all NVTX instrumentation
@yuan-luo ,May I ask if you have solved this problem? I encountered a similar problem
NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=ens19f0 MASTER_ADDR=172.16.1.134 WORLD_SIZE=2 RANK=0 python DeepEP/tests/test_low_latency.py
Allocating buffer size: 2115.111296 MB ...
[rank 0] Dispatch + combine bandwidth: 23.77 GB/s, avg_t=927.67 us, min_t=790.27 us, max_t=946.85 us
[rank 6] Dispatch + combine bandwidth: 23.77 GB/s, avg_t=927.48 us, min_t=784.70 us, max_t=946.30 us
[rank 7] Dispatch + combine bandwidth: 23.77 GB/s, avg_t=927.54 us, min_t=784.58 us, max_t=946.40 us
[rank 3] Dispatch + combine bandwidth: 23.78 GB/s, avg_t=927.20 us, min_t=791.84 us, max_t=945.63 us
[rank 5] Dispatch + combine bandwidth: 23.77 GB/s, avg_t=927.46 us, min_t=791.14 us, max_t=944.67 us
[rank 2] Dispatch + combine bandwidth: 23.78 GB/s, avg_t=927.33 us, min_t=789.38 us, max_t=949.34 us
[rank 1] Dispatch + combine bandwidth: 23.77 GB/s, avg_t=927.61 us, min_t=784.67 us, max_t=946.27 us
[rank 4] Dispatch + combine bandwidth: 23.77 GB/s, avg_t=927.67 us, min_t=794.78 us, max_t=945.12 us
[rank 0] Dispatch bandwidth: 38.82 GB/s, avg_t=193.48 us | Combine bandwidth: 43.14 GB/s, avg_t=336.98 us
[rank 6] Dispatch bandwidth: 39.91 GB/s, avg_t=188.19 us | Combine bandwidth: 42.89 GB/s, avg_t=338.92 us
[rank 3] Dispatch bandwidth: 39.87 GB/s, avg_t=188.42 us | Combine bandwidth: 42.96 GB/s, avg_t=338.36 us
[rank 5] Dispatch bandwidth: 39.38 GB/s, avg_t=190.77 us | Combine bandwidth: 43.29 GB/s, avg_t=335.82 us
[rank 4] Dispatch bandwidth: 39.14 GB/s, avg_t=191.90 us | Combine bandwidth: 43.23 GB/s, avg_t=336.27 us
[rank 1] Dispatch bandwidth: 39.16 GB/s, avg_t=191.80 us | Combine bandwidth: 43.12 GB/s, avg_t=337.12 us
[rank 7] Dispatch bandwidth: 39.87 GB/s, avg_t=188.41 us | Combine bandwidth: 43.00 GB/s, avg_t=338.03 us
[rank 2] Dispatch bandwidth: 39.68 GB/s, avg_t=189.30 us | Combine bandwidth: 42.86 GB/s, avg_t=339.20 us
[rank 7] Dispatch send/recv time: 35.53 us | Combine send/recv time: 40.60 us
[rank 1] Dispatch send/recv time: 39.25 us | Combine send/recv time: 40.27 us
[rank 2] Dispatch send/recv time: 36.21 us | Combine send/recv time: 41.88 us
[rank 4] Dispatch send/recv time: 35.42 us | Combine send/recv time: 41.97 us
[rank 6] Dispatch send/recv time: 34.60 us | Combine send/recv time: 40.50 us
[rank 3] Dispatch send/recv time: 36.04 us | Combine send/recv time: 41.72 us
[rank 0] Dispatch send/recv time: 38.82 us | Combine send/recv time: 40.93 us
[rank 5] Dispatch send/recv time: 35.33 us | Combine send/recv time: 40.70 us
[h20ln1:208 :0:208] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55cff010a)
[h20ln1:205 :0:205] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x561bbbd70)
[h20ln1:209 :0:209] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55f575575)
[h20ln1:210 :0:210] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x5623a6133)
[h20ln1:206 :0:206] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x5556227d8)
[h20ln1:207 :0:207] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55e0b0c75)
[h20ln1:203 :0:203] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55c7b7acb)
[h20ln1:204 :0:204] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x55e823eb6)
==== backtrace (tid: 203) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000001a0d6 ibv_dereg_mr() ???:0
2 0x000000000000cddc nvshmemt_ibrc_finalize() :0
3 0x0000000000220912 nvshmemi_transport_finalize() ???:0
4 0x00000000000b4859 nvshmemid_hostlib_finalize() ???:0
5 0x00000000001b2e7f nvshmemi_finalize() ???:0
6 0x0000000000055132 deep_ep::Buffer::~Buffer() /sgl-workspace/DeepEP/csrc/deep_ep.cpp:106
7 0x0000000000068e86 std::default_delete<deep_ep::Buffer>::operator()() /usr/include/c++/11/bits/unique_ptr.h:85
8 0x0000000000068e86 std::unique_ptr<deep_ep::Buffer, std::default_delete<deep_ep::Buffer> >::~unique_ptr() /usr/include/c++/11/bits/unique_ptr.h:361
9 0x0000000000068e86 pybind11::class_<deep_ep::Buffer>::dealloc() /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1926
10 0x0000000000516907 pybind11::detail::clear_instance() :0
11 0x00000000005174d1 pybind11_object_dealloc() :0
12 0x0000000000169b93 _Py_CheckFunctionResult() ???:0
13 0x00000000001a2407 PyObject_DelItem() ???:0
14 0x0000000000181370 PyMapping_Check() ???:0
15 0x000000000018b6a3 _PyFunction_Vectorcall() ???:0
16 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
17 0x000000000018b66c _PyFunction_Vectorcall() ???:0
18 0x0000000000177cf3 _PyEval_EvalFrameDefault() ???:0
19 0x000000000018b66c _PyFunction_Vectorcall() ???:0
20 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
21 0x000000000018b66c _PyFunction_Vectorcall() ???:0
22 0x0000000000175a74 _PyEval_EvalFrameDefault() ???:0
23 0x000000000018b66c _PyFunction_Vectorcall() ???:0
24 0x000000000017592f _PyEval_EvalFrameDefault() ???:0
25 0x000000000018b66c _PyFunction_Vectorcall() ???:0
26 0x0000000000176b43 _PyEval_EvalFrameDefault() ???:0
27 0x0000000000259f56 PyEval_EvalCode() ???:0
28 0x0000000000259e26 PyEval_EvalCode() ???:0
29 0x0000000000280808 PyUnicode_Tailmatch() ???:0
30 0x000000000027b00f PyInit__collections() ???:0
31 0x0000000000274d91 PyRun_StringFlags() ???:0
32 0x0000000000274c41 PyRun_SimpleStringFlags() ???:0
33 0x0000000000273f70 Py_RunMain() ???:0
34 0x000000000024de6d Py_BytesMain() ???:0
35 0x0000000000029d90 __libc_init_first() ???:0
36 0x0000000000029e40 __libc_start_main() ???:0
37 0x000000000024dd65 _start() ???:0