UCX using multiple interfaces causing performance to drop due to higher latency
Describe the bug
On a system with nodes made of quad socket CPUs where each socket has its own dedicated NIC, UCX seems to be using more than one NIC for a given mpi rank. Due to the high latency for crossing sockets, this leads to performance degradations. The NICs on this system are NDR IB NICs operating at 200Gbps, the peak bi-directional bandwidth is supposed to be 50 MB/s. When we run OSU's Bi-BW test, however, we see up to 90 MB/s of bandwidth. Investigation of ucx logs (obtained with UCX_LOG_LEVEL=data) for this test suggests all 4 NICs are being used for transfers.
When we enforce using 1 NIC only, e.g. using UCX_NET_DEVICES=mlx5_3:1 with proper pinning, then Bi-BW obtained from the OSU test peaks at 50 MB/s. Investigation of the logs suggest that only the specified NIC is being used for the transfers. However using only 1 NIC adversely affects the scaling efficiency of workloads.
To mitigate further, we used UCX_MAX_RNDV_RAILS=1 (with no other ucx environment variable), this also brings down the Bi-BW obtained with the OSU test back to 50 MB/s. However investigation of the logs suggest still 4 NICs are being used.
UCX logs are attached here
with -x UCX_MAX_RNDV_RAILS=1 with -x UCX_NET_DEVICES=mlx5_3:1 Default with no extra UCX env variables
Steps to Reproduce
Default case
- Command line
mpirun -np 2 --map-by ppr:1:node -x PATH -x LD_LIBRARY_PATH -H hpc-04:1,hpc-02:1 ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 8.46
2 17.32
4 34.04
8 68.96
16 139.40
32 284.47
64 473.55
128 856.83
256 1480.08
512 2892.64
1024 5554.62
2048 9243.91
4096 16360.18
8192 25914.67
16384 35251.82
32768 61789.94
65536 68835.51
131072 83308.48
262144 90082.03
524288 91742.73
1048576 91501.88
2097152 91421.49
4194304 90710.37
Using UCX_NET_DEVICES=mlx5_3:1
mpirun -np 2 --map-by ppr:1:node -x PATH -x LD_LIBRARY_PATH -H hpc-04:1,hpc-02:1 -x UCX_NET_DEVICES=mlx5_3:1 ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 8.82
2 18.07
4 36.88
8 71.19
16 144.49
32 284.73
64 464.56
128 846.72
256 1530.21
512 2991.00
1024 5977.21
2048 10069.56
4096 17430.70
8192 26506.61
16384 35462.24
32768 41263.31
65536 44711.44
131072 47028.48
262144 48088.30
524288 48895.34
1048576 49181.93
2097152 49328.65
4194304 49400.91
Using UCX_MAX_RNDV_RAILS=1
mpirun -np 2 --map-by ppr:1:node -x PATH -x LD_LIBRARY_PATH -H hpc-04:1,hpc-02:1 -x UCX_MAX_RNDV_RAILS=1 ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 8.92
2 18.40
4 36.52
8 70.81
16 149.20
32 291.59
64 464.90
128 855.56
256 1497.61
512 2886.56
1024 6107.93
2048 9917.07
4096 17107.22
8192 26048.81
16384 35111.41
32768 41246.49
65536 43126.46
131072 47215.79
262144 48303.64
524288 48877.88
1048576 48830.47
2097152 49324.69
4194304 49400.70
- UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by
ucx_info -v)
ucx_info -v
# Library version: 1.18.0
# Library path: /opt/hpcx-v2.21.2-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx/mt/lib/libucs.so.0
# API headers version: 1.18.0
# Git branch '', revision 152bf42
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --with-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.6.1/redhat8 --with-gdrcopy --prefix=/build-result/hpcx-v2.21.2-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37/redhat7
- Any UCX environment variables used
Separately tested with default parameters (i.e. no ucx environment variables), with UCX_NET_DEVICES specified, and with UCX_MAX_RNDV_RAILS=1
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
AlmaLinux release 9.5 (Teal Serval)
NAME="AlmaLinux"
VERSION="9.5 (Teal Serval)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.5"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.5 (Teal Serval)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"
ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.5"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.5"
SUPPORT_END=2032-06-01
AlmaLinux release 9.5 (Teal Serval)
AlmaLinux release 9.5 (Teal Serval)
- For RDMA/IB/RoCE related issues:
- Driver version:
ofed_info -s
OFED-internal-25.01-0.6.0:
- HW information from
ibstatoribv_devinfo -vvcommand
ibstat
CA 'mlx5_0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.44.1204
Hardware version: 0
Node GUID: 0xb83fd2030075786a
System image GUID: 0xb83fd2030075786a
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 13
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0xb83fd2030075786a
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4129
Number of ports: 1
Firmware version: 28.44.1204
Hardware version: 0
Node GUID: 0xb83fd2030085e31e
System image GUID: 0xb83fd2030085e31e
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0xb83fd2030085e31e
Link layer: InfiniBand
CA 'mlx5_2'
CA type: MT4129
Number of ports: 1
Firmware version: 28.44.1204
Hardware version: 0
Node GUID: 0xb83fd203008b8c5a
System image GUID: 0xb83fd203008b8c5a
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 11
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0xb83fd203008b8c5a
Link layer: InfiniBand
CA 'mlx5_3'
CA type: MT4129
Number of ports: 1
Firmware version: 28.44.1204
Hardware version: 0
Node GUID: 0xb83fd203008b8c58
System image GUID: 0xb83fd203008b8c58
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 10
LMC: 0
SM lid: 1
Capability mask: 0xa751e848
Port GUID: 0xb83fd203008b8c58
Link layer: InfiniBand
Additional information (depending on the issue)
- OpenMPI version
ompi_info
Package: Open MPI root@hpc-kernel-03 Distribution
Open MPI: 4.1.7rc1
- Output of
ucx_info -dto show transports and devices recognized by UCX
ucx_info -d
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# rkey_ptr is supported
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: self
# Device: memory
# Type: loopback
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 19360.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: tcp
# Device: ens1f1
# Type: network
# System device: ens1f1 (0)
#
# capabilities:
# bandwidth: 11.32/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: sysv
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: posix
# Component: posix
# allocate: <= 262919740K
# remote key: 24 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: posix
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: mlx5_0
# Component: ib
# allocate: <= 256K
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache), rdma (alloc,cache)
#
# Transport: rc_verbs
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (1)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 3 bytes
# ep address: 7 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (1)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3992
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: dc_mlx5
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (1)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: rc_mlx5
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (1)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 3 bytes
# ep address: 10 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (1)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 132
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_1
# Component: ib
# allocate: <= 256K
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache), rdma (alloc,cache)
#
# Transport: rc_verbs
# Device: mlx5_1:1
# Type: network
# System device: mlx5_1 (2)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 3 bytes
# ep address: 7 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_1:1
# Type: network
# System device: mlx5_1 (2)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3992
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: dc_mlx5
# Device: mlx5_1:1
# Type: network
# System device: mlx5_1 (2)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: rc_mlx5
# Device: mlx5_1:1
# Type: network
# System device: mlx5_1 (2)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 3 bytes
# ep address: 10 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_1:1
# Type: network
# System device: mlx5_1 (2)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 132
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_2
# Component: ib
# allocate: <= 256K
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache), rdma (alloc,cache)
#
# Transport: rc_verbs
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (3)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 3 bytes
# ep address: 7 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (3)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3992
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: dc_mlx5
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (3)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: rc_mlx5
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (3)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 3 bytes
# ep address: 10 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_2:1
# Type: network
# System device: mlx5_2 (3)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 132
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_3
# Component: ib
# allocate: <= 256K
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache), rdma (alloc,cache)
#
# Transport: rc_verbs
# Device: mlx5_3:1
# Type: network
# System device: mlx5_3 (4)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 3 bytes
# ep address: 7 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_3:1
# Type: network
# System device: mlx5_3 (4)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3992
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: dc_mlx5
# Device: mlx5_3:1
# Type: network
# System device: mlx5_3 (4)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: rc_mlx5
# Device: mlx5_3:1
# Type: network
# System device: mlx5_3 (4)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 3 bytes
# ep address: 10 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_3:1
# Type: network
# System device: mlx5_3 (4)
#
# capabilities:
# bandwidth: 22873.66/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 132
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_0
# Component: gga
# allocate: <= 256K
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache), rdma (alloc,cache)
# < no supported devices found >
#
# Memory domain: mlx5_1
# Component: gga
# allocate: <= 256K
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache), rdma (alloc,cache)
# < no supported devices found >
#
# Memory domain: mlx5_2
# Component: gga
# allocate: <= 256K
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache), rdma (alloc,cache)
# < no supported devices found >
#
# Memory domain: mlx5_3
# Component: gga
# allocate: <= 256K
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache), rdma (alloc,cache)
# < no supported devices found >
#
# Connection manager: rdmacm
# max_conn_priv: 54 bytes
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: cma
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
- Configure result - config.log N/A
- Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data" Attached
@yosefe for review and consideration. Many thanks in advance
Hi @arstgr
I'm not sure I understand to which degradation you reffer. As you described, running osu_bibw with multiple NICs resulted in better performance compared to 1 NIC (50 GB/s => 90 GB/s). Crossing sockets may be the cause for not reaching full wire speed, but this is expected when running a simple point to point benchmark.
Can you please explain what is your expected performance in this scenario?
Hi @shasson5 Thank you for looking into this, we highly appreciate it.
We see a performance degradation (i.e. low scaling efficiency) when running mpi jobs each node of the system has 4 sockets and 4 nics, each nic assigned to a single socket When we try to troubleshoot, we see limiting the number of rndv rails to 1 resolves the issue. We suspect each mpi rank is using more than 1 nic (with the cost of crossing a socket being high in this system, since that leads to high latency).
We tried to investigate this issue further using the osu benchmarks. When running a p2p test like osu_bibw with only 1 rank per node, we see a) the bandwidth is higher than the theoretical limit for 1 nic (we suspect this points to multiple nics being used) b) the ucx logs suggest qps are attached to multiple nics c) limiting the mpi rank to use only 1 nic resolves the issue, reducing the bw to the limit for 1 nic d) since there is high latency for crossing sockets, when multiple nics are used per mpi rank the performance is hit, due to the higher latencies
We need help understanding the source of this issue further We need help finding a mitigation that works even for applications that have their own mpi wrappers
Our current solution is to limit the number of rndv rails to 1, and assign mpi ranks to their own nics during the launch time
Hi @arstgr Thanks for the detailed information.
Can you please try to simultaneously run the following commands on some node:
taskset -c 0 ib_write_bw -d mlx5_0 -b --run_infinitely -F -p 18514
taskset -c 1 ib_write_bw -d mlx5_1 -b --run_infinitely -F -p 18515
and the following commands on another node:
taskset -c 0 ib_write_bw -d mlx5_0 -b --run_infinitely -F -p 18514 <node1>
taskset -c 1 ib_write_bw -d mlx5_1 -b --run_infinitely -F -p 18515 <node1>
where <node1> should be replaced with IP/hostname of first node.
You can download ib_write_bw from https://github.com/linux-rdma/perftest if needed.
Also, can you share CPU arch information?
Hi @shasson5
Thank you so much for looking into this issue. This is a quad socket system with each socket having 96 cores (x86 cores). Each socket has 1 IB NIC assigned to it. The order is a bit skewed: socket 0 cores 0-95 -> mlx5_3 socket 1 cores 96-191 -> mlx5_0 socket 2 cores 192-287 -> mlx5_1 socket 3 cores 288-383 -> mlx5_2
I ran 2 sets of tests (I wasn't sure what is the intention so tried to do it as broadly as possible). In the first set, the ib_write_bw test was pinned to core 0 (on socket 0 using mlx5_3) and 96 (on socket 1 using mlx5_0), here are the results
WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
RDMA_Write Bidirectional BW Test
Dual-port : OFF Device : mlx5_3
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x17 QPN 0x00b2 PSN 0x5dcd59 RKey 0x1fff0c VAddr 0x0014f8d6631000
remote address: LID 0x0e QPN 0x5f58 PSN 0x6c8156 RKey 0x00b1e9 VAddr 0x0014e31a47a000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 1889032 0.00 47074.70 0.753195
65536 1883135 0.00 47080.23 0.753284
65536 1883139 0.00 47080.30 0.753285
65536 1883111 0.00 47079.79 0.753277
65536 1883108 0.00 47080.02 0.753280
and
WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
RDMA_Write Bidirectional BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x0b QPN 0x00b2 PSN 0xd76860 RKey 0x1fff0c VAddr 0x00149901ed9000
remote address: LID 0x11 QPN 0x5fc5 PSN 0xa55ff8 RKey 0x0213f4 VAddr 0x00153e55c24000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 1887749 0.00 47080.65 0.753290
65536 1883341 0.00 47085.11 0.753362
65536 1883338 0.00 47084.90 0.753358
65536 1883334 0.00 47084.75 0.753356
65536 1883314 0.00 47084.31 0.753349
In the second set, the ib_write_bw test was pinned to core 0 (on socket 0 using mlx5_3) and core 1 (on socket 0 but using mlx5_0), here are the results
WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
RDMA_Write Bidirectional BW Test
Dual-port : OFF Device : mlx5_3
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x17 QPN 0x00b3 PSN 0x7aa19c RKey 0x1fff00 VAddr 0x001502ec3ad000
remote address: LID 0x0e QPN 0x5f59 PSN 0x260c1 RKey 0x00b1ea VAddr 0x0014649fc55000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 1887500 0.00 47074.11 0.753186
65536 1882996 0.00 47076.18 0.753219
65536 1883009 0.00 47077.02 0.753232
65536 1883026 0.00 47077.14 0.753234
65536 1883015 0.00 47076.85 0.753230
and
WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
RDMA_Write Bidirectional BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x0b QPN 0x00b3 PSN 0x15669 RKey 0x1fff00 VAddr 0x0014b7a76a7000
remote address: LID 0x11 QPN 0x5fc6 PSN 0xdbbfac RKey 0x0213f5 VAddr 0x0014708766d000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 1795328 0.00 44703.13 0.715250
65536 1791620 0.00 44791.44 0.716663
65536 1793567 0.00 44840.12 0.717442
65536 1793313 0.00 44834.22 0.717347
65536 1793457 0.00 44837.83 0.717405
Interestingly, the one where core 1 on socket 0 is using mlx5_0 (on socket 1) shows lower BW.
We really appreciate your help. Please let us know if there is any additional info that you need. Thank you very much
Hi @arstgr
Thanks for running the tests, indeed the second set is the relevant one. Can you please also run the following commands:
- numactl -H
- cat /proc/cpuinfo
- lscpu
- lstopo (installed from package hwloc)
Also, can you share the original MPI command you used and results? If this is a custom MPI application, what is the traffic pattern used (one to many, many to many, multiple pairs)? Please add --disaplay-map flag to the MPI command that you run, so it would be more verbose for debugging.
Thanks
@arstgr To reduce latency in CPU-involved point-to-point transfers, we have implemented the changes in https://github.com/openucx/ucx/pull/9408. Please give it a try.
Hi @shasson5 Thank you very much for your help.
Here is the numa config
numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 30755 MB
node 0 free: 30035 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 32247 MB
node 1 free: 31891 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 32247 MB
node 2 free: 31925 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 32247 MB
node 3 free: 31958 MB
node 4 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 4 size: 31957 MB
node 4 free: 31325 MB
node 5 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 5 size: 32247 MB
node 5 free: 31827 MB
node 6 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
node 6 size: 32247 MB
node 6 free: 31916 MB
node 7 cpus: 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 7 size: 32247 MB
node 7 free: 31950 MB
node 8 cpus: 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
node 8 size: 31957 MB
node 8 free: 31423 MB
node 9 cpus: 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 9 size: 32247 MB
node 9 free: 31894 MB
node 10 cpus: 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263
node 10 size: 32247 MB
node 10 free: 31907 MB
node 11 cpus: 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
node 11 size: 32205 MB
node 11 free: 31897 MB
node 12 cpus: 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311
node 12 size: 31957 MB
node 12 free: 31279 MB
node 13 cpus: 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335
node 13 size: 32247 MB
node 13 free: 31891 MB
node 14 cpus: 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359
node 14 size: 32247 MB
node 14 free: 31907 MB
node 15 cpus: 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383
node 15 size: 32223 MB
node 15 free: 31863 MB
node distances:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 12 12 12 32 32 32 32 32 32 32 32 32 32 32 32
1: 12 10 12 12 32 32 32 32 32 32 32 32 32 32 32 32
2: 12 12 10 12 32 32 32 32 32 32 32 32 32 32 32 32
3: 12 12 12 10 32 32 32 32 32 32 32 32 32 32 32 32
4: 32 32 32 32 10 12 12 12 32 32 32 32 32 32 32 32
5: 32 32 32 32 12 10 12 12 32 32 32 32 32 32 32 32
6: 32 32 32 32 12 12 10 12 32 32 32 32 32 32 32 32
7: 32 32 32 32 12 12 12 10 32 32 32 32 32 32 32 32
8: 32 32 32 32 32 32 32 32 10 12 12 12 32 32 32 32
9: 32 32 32 32 32 32 32 32 12 10 12 12 32 32 32 32
10: 32 32 32 32 32 32 32 32 12 12 10 12 32 32 32 32
11: 32 32 32 32 32 32 32 32 12 12 12 10 32 32 32 32
12: 32 32 32 32 32 32 32 32 32 32 32 32 10 12 12 12
13: 32 32 32 32 32 32 32 32 32 32 32 32 12 10 12 12
14: 32 32 32 32 32 32 32 32 32 32 32 32 12 12 10 12
15: 32 32 32 32 32 32 32 32 32 32 32 32 12 12 12 10
and lstopo
lstopo-no-graphics
Machine (502GB total)
Package L#0
Group0 L#0
NUMANode L#0 (P#0 30GB)
L3 L#0 (32MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L3 L#1 (32MB)
L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
L3 L#2 (32MB)
L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
HostBridge
PCIBridge
PCI 01:00.0 (InfiniBand)
Net "ibp1s0"
OpenFabrics "mlx5_3"
Group0 L#1
NUMANode L#1 (P#1 31GB)
L3 L#3 (32MB)
L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28)
L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29)
L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30)
L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)
L3 L#4 (32MB)
L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32)
L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33)
L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34)
L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35)
L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36)
L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37)
L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38)
L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39)
L3 L#5 (32MB)
L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40)
L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41)
L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42)
L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43)
L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44)
L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45)
L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46)
L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47)
HostBridge
PCIBridge
PCIBridge
PCI 12:00.0 (VGA)
PCIBridge
PCI 13:00.0 (NVMExp)
Block(Disk) "nvme0n1"
PCIBridge
PCI 15:00.0 (NVMExp)
Block(Disk) "nvme5n1"
Group0 L#2
NUMANode L#2 (P#2 31GB)
L3 L#6 (32MB)
L2 L#48 (1024KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48 + PU L#48 (P#48)
L2 L#49 (1024KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49 + PU L#49 (P#49)
L2 L#50 (1024KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50 + PU L#50 (P#50)
L2 L#51 (1024KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51 + PU L#51 (P#51)
L2 L#52 (1024KB) + L1d L#52 (32KB) + L1i L#52 (32KB) + Core L#52 + PU L#52 (P#52)
L2 L#53 (1024KB) + L1d L#53 (32KB) + L1i L#53 (32KB) + Core L#53 + PU L#53 (P#53)
L2 L#54 (1024KB) + L1d L#54 (32KB) + L1i L#54 (32KB) + Core L#54 + PU L#54 (P#54)
L2 L#55 (1024KB) + L1d L#55 (32KB) + L1i L#55 (32KB) + Core L#55 + PU L#55 (P#55)
L3 L#7 (32MB)
L2 L#56 (1024KB) + L1d L#56 (32KB) + L1i L#56 (32KB) + Core L#56 + PU L#56 (P#56)
L2 L#57 (1024KB) + L1d L#57 (32KB) + L1i L#57 (32KB) + Core L#57 + PU L#57 (P#57)
L2 L#58 (1024KB) + L1d L#58 (32KB) + L1i L#58 (32KB) + Core L#58 + PU L#58 (P#58)
L2 L#59 (1024KB) + L1d L#59 (32KB) + L1i L#59 (32KB) + Core L#59 + PU L#59 (P#59)
L2 L#60 (1024KB) + L1d L#60 (32KB) + L1i L#60 (32KB) + Core L#60 + PU L#60 (P#60)
L2 L#61 (1024KB) + L1d L#61 (32KB) + L1i L#61 (32KB) + Core L#61 + PU L#61 (P#61)
L2 L#62 (1024KB) + L1d L#62 (32KB) + L1i L#62 (32KB) + Core L#62 + PU L#62 (P#62)
L2 L#63 (1024KB) + L1d L#63 (32KB) + L1i L#63 (32KB) + Core L#63 + PU L#63 (P#63)
L3 L#8 (32MB)
L2 L#64 (1024KB) + L1d L#64 (32KB) + L1i L#64 (32KB) + Core L#64 + PU L#64 (P#64)
L2 L#65 (1024KB) + L1d L#65 (32KB) + L1i L#65 (32KB) + Core L#65 + PU L#65 (P#65)
L2 L#66 (1024KB) + L1d L#66 (32KB) + L1i L#66 (32KB) + Core L#66 + PU L#66 (P#66)
L2 L#67 (1024KB) + L1d L#67 (32KB) + L1i L#67 (32KB) + Core L#67 + PU L#67 (P#67)
L2 L#68 (1024KB) + L1d L#68 (32KB) + L1i L#68 (32KB) + Core L#68 + PU L#68 (P#68)
L2 L#69 (1024KB) + L1d L#69 (32KB) + L1i L#69 (32KB) + Core L#69 + PU L#69 (P#69)
L2 L#70 (1024KB) + L1d L#70 (32KB) + L1i L#70 (32KB) + Core L#70 + PU L#70 (P#70)
L2 L#71 (1024KB) + L1d L#71 (32KB) + L1i L#71 (32KB) + Core L#71 + PU L#71 (P#71)
HostBridge
PCIBridge
PCI 21:00.0 (NVMExp)
Block(Disk) "nvme8n1"
Group0 L#3
NUMANode L#3 (P#3 31GB)
L3 L#9 (32MB)
L2 L#72 (1024KB) + L1d L#72 (32KB) + L1i L#72 (32KB) + Core L#72 + PU L#72 (P#72)
L2 L#73 (1024KB) + L1d L#73 (32KB) + L1i L#73 (32KB) + Core L#73 + PU L#73 (P#73)
L2 L#74 (1024KB) + L1d L#74 (32KB) + L1i L#74 (32KB) + Core L#74 + PU L#74 (P#74)
L2 L#75 (1024KB) + L1d L#75 (32KB) + L1i L#75 (32KB) + Core L#75 + PU L#75 (P#75)
L2 L#76 (1024KB) + L1d L#76 (32KB) + L1i L#76 (32KB) + Core L#76 + PU L#76 (P#76)
L2 L#77 (1024KB) + L1d L#77 (32KB) + L1i L#77 (32KB) + Core L#77 + PU L#77 (P#77)
L2 L#78 (1024KB) + L1d L#78 (32KB) + L1i L#78 (32KB) + Core L#78 + PU L#78 (P#78)
L2 L#79 (1024KB) + L1d L#79 (32KB) + L1i L#79 (32KB) + Core L#79 + PU L#79 (P#79)
L3 L#10 (32MB)
L2 L#80 (1024KB) + L1d L#80 (32KB) + L1i L#80 (32KB) + Core L#80 + PU L#80 (P#80)
L2 L#81 (1024KB) + L1d L#81 (32KB) + L1i L#81 (32KB) + Core L#81 + PU L#81 (P#81)
L2 L#82 (1024KB) + L1d L#82 (32KB) + L1i L#82 (32KB) + Core L#82 + PU L#82 (P#82)
L2 L#83 (1024KB) + L1d L#83 (32KB) + L1i L#83 (32KB) + Core L#83 + PU L#83 (P#83)
L2 L#84 (1024KB) + L1d L#84 (32KB) + L1i L#84 (32KB) + Core L#84 + PU L#84 (P#84)
L2 L#85 (1024KB) + L1d L#85 (32KB) + L1i L#85 (32KB) + Core L#85 + PU L#85 (P#85)
L2 L#86 (1024KB) + L1d L#86 (32KB) + L1i L#86 (32KB) + Core L#86 + PU L#86 (P#86)
L2 L#87 (1024KB) + L1d L#87 (32KB) + L1i L#87 (32KB) + Core L#87 + PU L#87 (P#87)
L3 L#11 (32MB)
L2 L#88 (1024KB) + L1d L#88 (32KB) + L1i L#88 (32KB) + Core L#88 + PU L#88 (P#88)
L2 L#89 (1024KB) + L1d L#89 (32KB) + L1i L#89 (32KB) + Core L#89 + PU L#89 (P#89)
L2 L#90 (1024KB) + L1d L#90 (32KB) + L1i L#90 (32KB) + Core L#90 + PU L#90 (P#90)
L2 L#91 (1024KB) + L1d L#91 (32KB) + L1i L#91 (32KB) + Core L#91 + PU L#91 (P#91)
L2 L#92 (1024KB) + L1d L#92 (32KB) + L1i L#92 (32KB) + Core L#92 + PU L#92 (P#92)
L2 L#93 (1024KB) + L1d L#93 (32KB) + L1i L#93 (32KB) + Core L#93 + PU L#93 (P#93)
L2 L#94 (1024KB) + L1d L#94 (32KB) + L1i L#94 (32KB) + Core L#94 + PU L#94 (P#94)
L2 L#95 (1024KB) + L1d L#95 (32KB) + L1i L#95 (32KB) + Core L#95 + PU L#95 (P#95)
HostBridge
PCIBridge
PCI 31:00.0 (NVMExp)
Block(Disk) "nvme2n1"
Package L#1
Group0 L#4
NUMANode L#4 (P#4 31GB)
L3 L#12 (32MB)
L2 L#96 (1024KB) + L1d L#96 (32KB) + L1i L#96 (32KB) + Core L#96 + PU L#96 (P#96)
L2 L#97 (1024KB) + L1d L#97 (32KB) + L1i L#97 (32KB) + Core L#97 + PU L#97 (P#97)
L2 L#98 (1024KB) + L1d L#98 (32KB) + L1i L#98 (32KB) + Core L#98 + PU L#98 (P#98)
L2 L#99 (1024KB) + L1d L#99 (32KB) + L1i L#99 (32KB) + Core L#99 + PU L#99 (P#99)
L2 L#100 (1024KB) + L1d L#100 (32KB) + L1i L#100 (32KB) + Core L#100 + PU L#100 (P#100)
L2 L#101 (1024KB) + L1d L#101 (32KB) + L1i L#101 (32KB) + Core L#101 + PU L#101 (P#101)
L2 L#102 (1024KB) + L1d L#102 (32KB) + L1i L#102 (32KB) + Core L#102 + PU L#102 (P#102)
L2 L#103 (1024KB) + L1d L#103 (32KB) + L1i L#103 (32KB) + Core L#103 + PU L#103 (P#103)
L3 L#13 (32MB)
L2 L#104 (1024KB) + L1d L#104 (32KB) + L1i L#104 (32KB) + Core L#104 + PU L#104 (P#104)
L2 L#105 (1024KB) + L1d L#105 (32KB) + L1i L#105 (32KB) + Core L#105 + PU L#105 (P#105)
L2 L#106 (1024KB) + L1d L#106 (32KB) + L1i L#106 (32KB) + Core L#106 + PU L#106 (P#106)
L2 L#107 (1024KB) + L1d L#107 (32KB) + L1i L#107 (32KB) + Core L#107 + PU L#107 (P#107)
L2 L#108 (1024KB) + L1d L#108 (32KB) + L1i L#108 (32KB) + Core L#108 + PU L#108 (P#108)
L2 L#109 (1024KB) + L1d L#109 (32KB) + L1i L#109 (32KB) + Core L#109 + PU L#109 (P#109)
L2 L#110 (1024KB) + L1d L#110 (32KB) + L1i L#110 (32KB) + Core L#110 + PU L#110 (P#110)
L2 L#111 (1024KB) + L1d L#111 (32KB) + L1i L#111 (32KB) + Core L#111 + PU L#111 (P#111)
L3 L#14 (32MB)
L2 L#112 (1024KB) + L1d L#112 (32KB) + L1i L#112 (32KB) + Core L#112 + PU L#112 (P#112)
L2 L#113 (1024KB) + L1d L#113 (32KB) + L1i L#113 (32KB) + Core L#113 + PU L#113 (P#113)
L2 L#114 (1024KB) + L1d L#114 (32KB) + L1i L#114 (32KB) + Core L#114 + PU L#114 (P#114)
L2 L#115 (1024KB) + L1d L#115 (32KB) + L1i L#115 (32KB) + Core L#115 + PU L#115 (P#115)
L2 L#116 (1024KB) + L1d L#116 (32KB) + L1i L#116 (32KB) + Core L#116 + PU L#116 (P#116)
L2 L#117 (1024KB) + L1d L#117 (32KB) + L1i L#117 (32KB) + Core L#117 + PU L#117 (P#117)
L2 L#118 (1024KB) + L1d L#118 (32KB) + L1i L#118 (32KB) + Core L#118 + PU L#118 (P#118)
L2 L#119 (1024KB) + L1d L#119 (32KB) + L1i L#119 (32KB) + Core L#119 + PU L#119 (P#119)
HostBridge
PCIBridge
PCI 41:00.0 (InfiniBand)
Net "ibp65s0"
OpenFabrics "mlx5_0"
Group0 L#5
NUMANode L#5 (P#5 31GB)
L3 L#15 (32MB)
L2 L#120 (1024KB) + L1d L#120 (32KB) + L1i L#120 (32KB) + Core L#120 + PU L#120 (P#120)
L2 L#121 (1024KB) + L1d L#121 (32KB) + L1i L#121 (32KB) + Core L#121 + PU L#121 (P#121)
L2 L#122 (1024KB) + L1d L#122 (32KB) + L1i L#122 (32KB) + Core L#122 + PU L#122 (P#122)
L2 L#123 (1024KB) + L1d L#123 (32KB) + L1i L#123 (32KB) + Core L#123 + PU L#123 (P#123)
L2 L#124 (1024KB) + L1d L#124 (32KB) + L1i L#124 (32KB) + Core L#124 + PU L#124 (P#124)
L2 L#125 (1024KB) + L1d L#125 (32KB) + L1i L#125 (32KB) + Core L#125 + PU L#125 (P#125)
L2 L#126 (1024KB) + L1d L#126 (32KB) + L1i L#126 (32KB) + Core L#126 + PU L#126 (P#126)
L2 L#127 (1024KB) + L1d L#127 (32KB) + L1i L#127 (32KB) + Core L#127 + PU L#127 (P#127)
L3 L#16 (32MB)
L2 L#128 (1024KB) + L1d L#128 (32KB) + L1i L#128 (32KB) + Core L#128 + PU L#128 (P#128)
L2 L#129 (1024KB) + L1d L#129 (32KB) + L1i L#129 (32KB) + Core L#129 + PU L#129 (P#129)
L2 L#130 (1024KB) + L1d L#130 (32KB) + L1i L#130 (32KB) + Core L#130 + PU L#130 (P#130)
L2 L#131 (1024KB) + L1d L#131 (32KB) + L1i L#131 (32KB) + Core L#131 + PU L#131 (P#131)
L2 L#132 (1024KB) + L1d L#132 (32KB) + L1i L#132 (32KB) + Core L#132 + PU L#132 (P#132)
L2 L#133 (1024KB) + L1d L#133 (32KB) + L1i L#133 (32KB) + Core L#133 + PU L#133 (P#133)
L2 L#134 (1024KB) + L1d L#134 (32KB) + L1i L#134 (32KB) + Core L#134 + PU L#134 (P#134)
L2 L#135 (1024KB) + L1d L#135 (32KB) + L1i L#135 (32KB) + Core L#135 + PU L#135 (P#135)
L3 L#17 (32MB)
L2 L#136 (1024KB) + L1d L#136 (32KB) + L1i L#136 (32KB) + Core L#136 + PU L#136 (P#136)
L2 L#137 (1024KB) + L1d L#137 (32KB) + L1i L#137 (32KB) + Core L#137 + PU L#137 (P#137)
L2 L#138 (1024KB) + L1d L#138 (32KB) + L1i L#138 (32KB) + Core L#138 + PU L#138 (P#138)
L2 L#139 (1024KB) + L1d L#139 (32KB) + L1i L#139 (32KB) + Core L#139 + PU L#139 (P#139)
L2 L#140 (1024KB) + L1d L#140 (32KB) + L1i L#140 (32KB) + Core L#140 + PU L#140 (P#140)
L2 L#141 (1024KB) + L1d L#141 (32KB) + L1i L#141 (32KB) + Core L#141 + PU L#141 (P#141)
L2 L#142 (1024KB) + L1d L#142 (32KB) + L1i L#142 (32KB) + Core L#142 + PU L#142 (P#142)
L2 L#143 (1024KB) + L1d L#143 (32KB) + L1i L#143 (32KB) + Core L#143 + PU L#143 (P#143)
HostBridge
PCIBridge
PCI 51:00.0 (SCSI)
PCI 51:00.1 (Ethernet)
Net "ens1f1"
PCI 51:00.2 (SCSI)
Group0 L#6
NUMANode L#6 (P#6 31GB)
L3 L#18 (32MB)
L2 L#144 (1024KB) + L1d L#144 (32KB) + L1i L#144 (32KB) + Core L#144 + PU L#144 (P#144)
L2 L#145 (1024KB) + L1d L#145 (32KB) + L1i L#145 (32KB) + Core L#145 + PU L#145 (P#145)
L2 L#146 (1024KB) + L1d L#146 (32KB) + L1i L#146 (32KB) + Core L#146 + PU L#146 (P#146)
L2 L#147 (1024KB) + L1d L#147 (32KB) + L1i L#147 (32KB) + Core L#147 + PU L#147 (P#147)
L2 L#148 (1024KB) + L1d L#148 (32KB) + L1i L#148 (32KB) + Core L#148 + PU L#148 (P#148)
L2 L#149 (1024KB) + L1d L#149 (32KB) + L1i L#149 (32KB) + Core L#149 + PU L#149 (P#149)
L2 L#150 (1024KB) + L1d L#150 (32KB) + L1i L#150 (32KB) + Core L#150 + PU L#150 (P#150)
L2 L#151 (1024KB) + L1d L#151 (32KB) + L1i L#151 (32KB) + Core L#151 + PU L#151 (P#151)
L3 L#19 (32MB)
L2 L#152 (1024KB) + L1d L#152 (32KB) + L1i L#152 (32KB) + Core L#152 + PU L#152 (P#152)
L2 L#153 (1024KB) + L1d L#153 (32KB) + L1i L#153 (32KB) + Core L#153 + PU L#153 (P#153)
L2 L#154 (1024KB) + L1d L#154 (32KB) + L1i L#154 (32KB) + Core L#154 + PU L#154 (P#154)
L2 L#155 (1024KB) + L1d L#155 (32KB) + L1i L#155 (32KB) + Core L#155 + PU L#155 (P#155)
L2 L#156 (1024KB) + L1d L#156 (32KB) + L1i L#156 (32KB) + Core L#156 + PU L#156 (P#156)
L2 L#157 (1024KB) + L1d L#157 (32KB) + L1i L#157 (32KB) + Core L#157 + PU L#157 (P#157)
L2 L#158 (1024KB) + L1d L#158 (32KB) + L1i L#158 (32KB) + Core L#158 + PU L#158 (P#158)
L2 L#159 (1024KB) + L1d L#159 (32KB) + L1i L#159 (32KB) + Core L#159 + PU L#159 (P#159)
L3 L#20 (32MB)
L2 L#160 (1024KB) + L1d L#160 (32KB) + L1i L#160 (32KB) + Core L#160 + PU L#160 (P#160)
L2 L#161 (1024KB) + L1d L#161 (32KB) + L1i L#161 (32KB) + Core L#161 + PU L#161 (P#161)
L2 L#162 (1024KB) + L1d L#162 (32KB) + L1i L#162 (32KB) + Core L#162 + PU L#162 (P#162)
L2 L#163 (1024KB) + L1d L#163 (32KB) + L1i L#163 (32KB) + Core L#163 + PU L#163 (P#163)
L2 L#164 (1024KB) + L1d L#164 (32KB) + L1i L#164 (32KB) + Core L#164 + PU L#164 (P#164)
L2 L#165 (1024KB) + L1d L#165 (32KB) + L1i L#165 (32KB) + Core L#165 + PU L#165 (P#165)
L2 L#166 (1024KB) + L1d L#166 (32KB) + L1i L#166 (32KB) + Core L#166 + PU L#166 (P#166)
L2 L#167 (1024KB) + L1d L#167 (32KB) + L1i L#167 (32KB) + Core L#167 + PU L#167 (P#167)
HostBridge
PCIBridge
PCI 61:00.0 (NVMExp)
Block(Disk) "nvme1n1"
Group0 L#7
NUMANode L#7 (P#7 31GB)
L3 L#21 (32MB)
L2 L#168 (1024KB) + L1d L#168 (32KB) + L1i L#168 (32KB) + Core L#168 + PU L#168 (P#168)
L2 L#169 (1024KB) + L1d L#169 (32KB) + L1i L#169 (32KB) + Core L#169 + PU L#169 (P#169)
L2 L#170 (1024KB) + L1d L#170 (32KB) + L1i L#170 (32KB) + Core L#170 + PU L#170 (P#170)
L2 L#171 (1024KB) + L1d L#171 (32KB) + L1i L#171 (32KB) + Core L#171 + PU L#171 (P#171)
L2 L#172 (1024KB) + L1d L#172 (32KB) + L1i L#172 (32KB) + Core L#172 + PU L#172 (P#172)
L2 L#173 (1024KB) + L1d L#173 (32KB) + L1i L#173 (32KB) + Core L#173 + PU L#173 (P#173)
L2 L#174 (1024KB) + L1d L#174 (32KB) + L1i L#174 (32KB) + Core L#174 + PU L#174 (P#174)
L2 L#175 (1024KB) + L1d L#175 (32KB) + L1i L#175 (32KB) + Core L#175 + PU L#175 (P#175)
L3 L#22 (32MB)
L2 L#176 (1024KB) + L1d L#176 (32KB) + L1i L#176 (32KB) + Core L#176 + PU L#176 (P#176)
L2 L#177 (1024KB) + L1d L#177 (32KB) + L1i L#177 (32KB) + Core L#177 + PU L#177 (P#177)
L2 L#178 (1024KB) + L1d L#178 (32KB) + L1i L#178 (32KB) + Core L#178 + PU L#178 (P#178)
L2 L#179 (1024KB) + L1d L#179 (32KB) + L1i L#179 (32KB) + Core L#179 + PU L#179 (P#179)
L2 L#180 (1024KB) + L1d L#180 (32KB) + L1i L#180 (32KB) + Core L#180 + PU L#180 (P#180)
L2 L#181 (1024KB) + L1d L#181 (32KB) + L1i L#181 (32KB) + Core L#181 + PU L#181 (P#181)
L2 L#182 (1024KB) + L1d L#182 (32KB) + L1i L#182 (32KB) + Core L#182 + PU L#182 (P#182)
L2 L#183 (1024KB) + L1d L#183 (32KB) + L1i L#183 (32KB) + Core L#183 + PU L#183 (P#183)
L3 L#23 (32MB)
L2 L#184 (1024KB) + L1d L#184 (32KB) + L1i L#184 (32KB) + Core L#184 + PU L#184 (P#184)
L2 L#185 (1024KB) + L1d L#185 (32KB) + L1i L#185 (32KB) + Core L#185 + PU L#185 (P#185)
L2 L#186 (1024KB) + L1d L#186 (32KB) + L1i L#186 (32KB) + Core L#186 + PU L#186 (P#186)
L2 L#187 (1024KB) + L1d L#187 (32KB) + L1i L#187 (32KB) + Core L#187 + PU L#187 (P#187)
L2 L#188 (1024KB) + L1d L#188 (32KB) + L1i L#188 (32KB) + Core L#188 + PU L#188 (P#188)
L2 L#189 (1024KB) + L1d L#189 (32KB) + L1i L#189 (32KB) + Core L#189 + PU L#189 (P#189)
L2 L#190 (1024KB) + L1d L#190 (32KB) + L1i L#190 (32KB) + Core L#190 + PU L#190 (P#190)
L2 L#191 (1024KB) + L1d L#191 (32KB) + L1i L#191 (32KB) + Core L#191 + PU L#191 (P#191)
HostBridge
PCIBridge
PCI 71:00.0 (NVMExp)
Block(Disk) "nvme4n1"
Package L#2
Group0 L#8
NUMANode L#8 (P#8 31GB)
L3 L#24 (32MB)
L2 L#192 (1024KB) + L1d L#192 (32KB) + L1i L#192 (32KB) + Core L#192 + PU L#192 (P#192)
L2 L#193 (1024KB) + L1d L#193 (32KB) + L1i L#193 (32KB) + Core L#193 + PU L#193 (P#193)
L2 L#194 (1024KB) + L1d L#194 (32KB) + L1i L#194 (32KB) + Core L#194 + PU L#194 (P#194)
L2 L#195 (1024KB) + L1d L#195 (32KB) + L1i L#195 (32KB) + Core L#195 + PU L#195 (P#195)
L2 L#196 (1024KB) + L1d L#196 (32KB) + L1i L#196 (32KB) + Core L#196 + PU L#196 (P#196)
L2 L#197 (1024KB) + L1d L#197 (32KB) + L1i L#197 (32KB) + Core L#197 + PU L#197 (P#197)
L2 L#198 (1024KB) + L1d L#198 (32KB) + L1i L#198 (32KB) + Core L#198 + PU L#198 (P#198)
L2 L#199 (1024KB) + L1d L#199 (32KB) + L1i L#199 (32KB) + Core L#199 + PU L#199 (P#199)
L3 L#25 (32MB)
L2 L#200 (1024KB) + L1d L#200 (32KB) + L1i L#200 (32KB) + Core L#200 + PU L#200 (P#200)
L2 L#201 (1024KB) + L1d L#201 (32KB) + L1i L#201 (32KB) + Core L#201 + PU L#201 (P#201)
L2 L#202 (1024KB) + L1d L#202 (32KB) + L1i L#202 (32KB) + Core L#202 + PU L#202 (P#202)
L2 L#203 (1024KB) + L1d L#203 (32KB) + L1i L#203 (32KB) + Core L#203 + PU L#203 (P#203)
L2 L#204 (1024KB) + L1d L#204 (32KB) + L1i L#204 (32KB) + Core L#204 + PU L#204 (P#204)
L2 L#205 (1024KB) + L1d L#205 (32KB) + L1i L#205 (32KB) + Core L#205 + PU L#205 (P#205)
L2 L#206 (1024KB) + L1d L#206 (32KB) + L1i L#206 (32KB) + Core L#206 + PU L#206 (P#206)
L2 L#207 (1024KB) + L1d L#207 (32KB) + L1i L#207 (32KB) + Core L#207 + PU L#207 (P#207)
L3 L#26 (32MB)
L2 L#208 (1024KB) + L1d L#208 (32KB) + L1i L#208 (32KB) + Core L#208 + PU L#208 (P#208)
L2 L#209 (1024KB) + L1d L#209 (32KB) + L1i L#209 (32KB) + Core L#209 + PU L#209 (P#209)
L2 L#210 (1024KB) + L1d L#210 (32KB) + L1i L#210 (32KB) + Core L#210 + PU L#210 (P#210)
L2 L#211 (1024KB) + L1d L#211 (32KB) + L1i L#211 (32KB) + Core L#211 + PU L#211 (P#211)
L2 L#212 (1024KB) + L1d L#212 (32KB) + L1i L#212 (32KB) + Core L#212 + PU L#212 (P#212)
L2 L#213 (1024KB) + L1d L#213 (32KB) + L1i L#213 (32KB) + Core L#213 + PU L#213 (P#213)
L2 L#214 (1024KB) + L1d L#214 (32KB) + L1i L#214 (32KB) + Core L#214 + PU L#214 (P#214)
L2 L#215 (1024KB) + L1d L#215 (32KB) + L1i L#215 (32KB) + Core L#215 + PU L#215 (P#215)
HostBridge
PCIBridge
PCI 81:00.0 (InfiniBand)
Net "ibp129s0"
OpenFabrics "mlx5_1"
Group0 L#9
NUMANode L#9 (P#9 31GB)
L3 L#27 (32MB)
L2 L#216 (1024KB) + L1d L#216 (32KB) + L1i L#216 (32KB) + Core L#216 + PU L#216 (P#216)
L2 L#217 (1024KB) + L1d L#217 (32KB) + L1i L#217 (32KB) + Core L#217 + PU L#217 (P#217)
L2 L#218 (1024KB) + L1d L#218 (32KB) + L1i L#218 (32KB) + Core L#218 + PU L#218 (P#218)
L2 L#219 (1024KB) + L1d L#219 (32KB) + L1i L#219 (32KB) + Core L#219 + PU L#219 (P#219)
L2 L#220 (1024KB) + L1d L#220 (32KB) + L1i L#220 (32KB) + Core L#220 + PU L#220 (P#220)
L2 L#221 (1024KB) + L1d L#221 (32KB) + L1i L#221 (32KB) + Core L#221 + PU L#221 (P#221)
L2 L#222 (1024KB) + L1d L#222 (32KB) + L1i L#222 (32KB) + Core L#222 + PU L#222 (P#222)
L2 L#223 (1024KB) + L1d L#223 (32KB) + L1i L#223 (32KB) + Core L#223 + PU L#223 (P#223)
L3 L#28 (32MB)
L2 L#224 (1024KB) + L1d L#224 (32KB) + L1i L#224 (32KB) + Core L#224 + PU L#224 (P#224)
L2 L#225 (1024KB) + L1d L#225 (32KB) + L1i L#225 (32KB) + Core L#225 + PU L#225 (P#225)
L2 L#226 (1024KB) + L1d L#226 (32KB) + L1i L#226 (32KB) + Core L#226 + PU L#226 (P#226)
L2 L#227 (1024KB) + L1d L#227 (32KB) + L1i L#227 (32KB) + Core L#227 + PU L#227 (P#227)
L2 L#228 (1024KB) + L1d L#228 (32KB) + L1i L#228 (32KB) + Core L#228 + PU L#228 (P#228)
L2 L#229 (1024KB) + L1d L#229 (32KB) + L1i L#229 (32KB) + Core L#229 + PU L#229 (P#229)
L2 L#230 (1024KB) + L1d L#230 (32KB) + L1i L#230 (32KB) + Core L#230 + PU L#230 (P#230)
L2 L#231 (1024KB) + L1d L#231 (32KB) + L1i L#231 (32KB) + Core L#231 + PU L#231 (P#231)
L3 L#29 (32MB)
L2 L#232 (1024KB) + L1d L#232 (32KB) + L1i L#232 (32KB) + Core L#232 + PU L#232 (P#232)
L2 L#233 (1024KB) + L1d L#233 (32KB) + L1i L#233 (32KB) + Core L#233 + PU L#233 (P#233)
L2 L#234 (1024KB) + L1d L#234 (32KB) + L1i L#234 (32KB) + Core L#234 + PU L#234 (P#234)
L2 L#235 (1024KB) + L1d L#235 (32KB) + L1i L#235 (32KB) + Core L#235 + PU L#235 (P#235)
L2 L#236 (1024KB) + L1d L#236 (32KB) + L1i L#236 (32KB) + Core L#236 + PU L#236 (P#236)
L2 L#237 (1024KB) + L1d L#237 (32KB) + L1i L#237 (32KB) + Core L#237 + PU L#237 (P#237)
L2 L#238 (1024KB) + L1d L#238 (32KB) + L1i L#238 (32KB) + Core L#238 + PU L#238 (P#238)
L2 L#239 (1024KB) + L1d L#239 (32KB) + L1i L#239 (32KB) + Core L#239 + PU L#239 (P#239)
Group0 L#10
NUMANode L#10 (P#10 31GB)
L3 L#30 (32MB)
L2 L#240 (1024KB) + L1d L#240 (32KB) + L1i L#240 (32KB) + Core L#240 + PU L#240 (P#240)
L2 L#241 (1024KB) + L1d L#241 (32KB) + L1i L#241 (32KB) + Core L#241 + PU L#241 (P#241)
L2 L#242 (1024KB) + L1d L#242 (32KB) + L1i L#242 (32KB) + Core L#242 + PU L#242 (P#242)
L2 L#243 (1024KB) + L1d L#243 (32KB) + L1i L#243 (32KB) + Core L#243 + PU L#243 (P#243)
L2 L#244 (1024KB) + L1d L#244 (32KB) + L1i L#244 (32KB) + Core L#244 + PU L#244 (P#244)
L2 L#245 (1024KB) + L1d L#245 (32KB) + L1i L#245 (32KB) + Core L#245 + PU L#245 (P#245)
L2 L#246 (1024KB) + L1d L#246 (32KB) + L1i L#246 (32KB) + Core L#246 + PU L#246 (P#246)
L2 L#247 (1024KB) + L1d L#247 (32KB) + L1i L#247 (32KB) + Core L#247 + PU L#247 (P#247)
L3 L#31 (32MB)
L2 L#248 (1024KB) + L1d L#248 (32KB) + L1i L#248 (32KB) + Core L#248 + PU L#248 (P#248)
L2 L#249 (1024KB) + L1d L#249 (32KB) + L1i L#249 (32KB) + Core L#249 + PU L#249 (P#249)
L2 L#250 (1024KB) + L1d L#250 (32KB) + L1i L#250 (32KB) + Core L#250 + PU L#250 (P#250)
L2 L#251 (1024KB) + L1d L#251 (32KB) + L1i L#251 (32KB) + Core L#251 + PU L#251 (P#251)
L2 L#252 (1024KB) + L1d L#252 (32KB) + L1i L#252 (32KB) + Core L#252 + PU L#252 (P#252)
L2 L#253 (1024KB) + L1d L#253 (32KB) + L1i L#253 (32KB) + Core L#253 + PU L#253 (P#253)
L2 L#254 (1024KB) + L1d L#254 (32KB) + L1i L#254 (32KB) + Core L#254 + PU L#254 (P#254)
L2 L#255 (1024KB) + L1d L#255 (32KB) + L1i L#255 (32KB) + Core L#255 + PU L#255 (P#255)
L3 L#32 (32MB)
L2 L#256 (1024KB) + L1d L#256 (32KB) + L1i L#256 (32KB) + Core L#256 + PU L#256 (P#256)
L2 L#257 (1024KB) + L1d L#257 (32KB) + L1i L#257 (32KB) + Core L#257 + PU L#257 (P#257)
L2 L#258 (1024KB) + L1d L#258 (32KB) + L1i L#258 (32KB) + Core L#258 + PU L#258 (P#258)
L2 L#259 (1024KB) + L1d L#259 (32KB) + L1i L#259 (32KB) + Core L#259 + PU L#259 (P#259)
L2 L#260 (1024KB) + L1d L#260 (32KB) + L1i L#260 (32KB) + Core L#260 + PU L#260 (P#260)
L2 L#261 (1024KB) + L1d L#261 (32KB) + L1i L#261 (32KB) + Core L#261 + PU L#261 (P#261)
L2 L#262 (1024KB) + L1d L#262 (32KB) + L1i L#262 (32KB) + Core L#262 + PU L#262 (P#262)
L2 L#263 (1024KB) + L1d L#263 (32KB) + L1i L#263 (32KB) + Core L#263 + PU L#263 (P#263)
HostBridge
PCIBridge
PCI a1:00.0 (NVMExp)
Block(Disk) "nvme3n1"
Group0 L#11
NUMANode L#11 (P#11 31GB)
L3 L#33 (32MB)
L2 L#264 (1024KB) + L1d L#264 (32KB) + L1i L#264 (32KB) + Core L#264 + PU L#264 (P#264)
L2 L#265 (1024KB) + L1d L#265 (32KB) + L1i L#265 (32KB) + Core L#265 + PU L#265 (P#265)
L2 L#266 (1024KB) + L1d L#266 (32KB) + L1i L#266 (32KB) + Core L#266 + PU L#266 (P#266)
L2 L#267 (1024KB) + L1d L#267 (32KB) + L1i L#267 (32KB) + Core L#267 + PU L#267 (P#267)
L2 L#268 (1024KB) + L1d L#268 (32KB) + L1i L#268 (32KB) + Core L#268 + PU L#268 (P#268)
L2 L#269 (1024KB) + L1d L#269 (32KB) + L1i L#269 (32KB) + Core L#269 + PU L#269 (P#269)
L2 L#270 (1024KB) + L1d L#270 (32KB) + L1i L#270 (32KB) + Core L#270 + PU L#270 (P#270)
L2 L#271 (1024KB) + L1d L#271 (32KB) + L1i L#271 (32KB) + Core L#271 + PU L#271 (P#271)
L3 L#34 (32MB)
L2 L#272 (1024KB) + L1d L#272 (32KB) + L1i L#272 (32KB) + Core L#272 + PU L#272 (P#272)
L2 L#273 (1024KB) + L1d L#273 (32KB) + L1i L#273 (32KB) + Core L#273 + PU L#273 (P#273)
L2 L#274 (1024KB) + L1d L#274 (32KB) + L1i L#274 (32KB) + Core L#274 + PU L#274 (P#274)
L2 L#275 (1024KB) + L1d L#275 (32KB) + L1i L#275 (32KB) + Core L#275 + PU L#275 (P#275)
L2 L#276 (1024KB) + L1d L#276 (32KB) + L1i L#276 (32KB) + Core L#276 + PU L#276 (P#276)
L2 L#277 (1024KB) + L1d L#277 (32KB) + L1i L#277 (32KB) + Core L#277 + PU L#277 (P#277)
L2 L#278 (1024KB) + L1d L#278 (32KB) + L1i L#278 (32KB) + Core L#278 + PU L#278 (P#278)
L2 L#279 (1024KB) + L1d L#279 (32KB) + L1i L#279 (32KB) + Core L#279 + PU L#279 (P#279)
L3 L#35 (32MB)
L2 L#280 (1024KB) + L1d L#280 (32KB) + L1i L#280 (32KB) + Core L#280 + PU L#280 (P#280)
L2 L#281 (1024KB) + L1d L#281 (32KB) + L1i L#281 (32KB) + Core L#281 + PU L#281 (P#281)
L2 L#282 (1024KB) + L1d L#282 (32KB) + L1i L#282 (32KB) + Core L#282 + PU L#282 (P#282)
L2 L#283 (1024KB) + L1d L#283 (32KB) + L1i L#283 (32KB) + Core L#283 + PU L#283 (P#283)
L2 L#284 (1024KB) + L1d L#284 (32KB) + L1i L#284 (32KB) + Core L#284 + PU L#284 (P#284)
L2 L#285 (1024KB) + L1d L#285 (32KB) + L1i L#285 (32KB) + Core L#285 + PU L#285 (P#285)
L2 L#286 (1024KB) + L1d L#286 (32KB) + L1i L#286 (32KB) + Core L#286 + PU L#286 (P#286)
L2 L#287 (1024KB) + L1d L#287 (32KB) + L1i L#287 (32KB) + Core L#287 + PU L#287 (P#287)
HostBridge
PCIBridge
PCI b1:00.0 (NVMExp)
Block(Disk) "nvme6n1"
Package L#3
Group0 L#12
NUMANode L#12 (P#12 31GB)
L3 L#36 (32MB)
L2 L#288 (1024KB) + L1d L#288 (32KB) + L1i L#288 (32KB) + Core L#288 + PU L#288 (P#288)
L2 L#289 (1024KB) + L1d L#289 (32KB) + L1i L#289 (32KB) + Core L#289 + PU L#289 (P#289)
L2 L#290 (1024KB) + L1d L#290 (32KB) + L1i L#290 (32KB) + Core L#290 + PU L#290 (P#290)
L2 L#291 (1024KB) + L1d L#291 (32KB) + L1i L#291 (32KB) + Core L#291 + PU L#291 (P#291)
L2 L#292 (1024KB) + L1d L#292 (32KB) + L1i L#292 (32KB) + Core L#292 + PU L#292 (P#292)
L2 L#293 (1024KB) + L1d L#293 (32KB) + L1i L#293 (32KB) + Core L#293 + PU L#293 (P#293)
L2 L#294 (1024KB) + L1d L#294 (32KB) + L1i L#294 (32KB) + Core L#294 + PU L#294 (P#294)
L2 L#295 (1024KB) + L1d L#295 (32KB) + L1i L#295 (32KB) + Core L#295 + PU L#295 (P#295)
L3 L#37 (32MB)
L2 L#296 (1024KB) + L1d L#296 (32KB) + L1i L#296 (32KB) + Core L#296 + PU L#296 (P#296)
L2 L#297 (1024KB) + L1d L#297 (32KB) + L1i L#297 (32KB) + Core L#297 + PU L#297 (P#297)
L2 L#298 (1024KB) + L1d L#298 (32KB) + L1i L#298 (32KB) + Core L#298 + PU L#298 (P#298)
L2 L#299 (1024KB) + L1d L#299 (32KB) + L1i L#299 (32KB) + Core L#299 + PU L#299 (P#299)
L2 L#300 (1024KB) + L1d L#300 (32KB) + L1i L#300 (32KB) + Core L#300 + PU L#300 (P#300)
L2 L#301 (1024KB) + L1d L#301 (32KB) + L1i L#301 (32KB) + Core L#301 + PU L#301 (P#301)
L2 L#302 (1024KB) + L1d L#302 (32KB) + L1i L#302 (32KB) + Core L#302 + PU L#302 (P#302)
L2 L#303 (1024KB) + L1d L#303 (32KB) + L1i L#303 (32KB) + Core L#303 + PU L#303 (P#303)
L3 L#38 (32MB)
L2 L#304 (1024KB) + L1d L#304 (32KB) + L1i L#304 (32KB) + Core L#304 + PU L#304 (P#304)
L2 L#305 (1024KB) + L1d L#305 (32KB) + L1i L#305 (32KB) + Core L#305 + PU L#305 (P#305)
L2 L#306 (1024KB) + L1d L#306 (32KB) + L1i L#306 (32KB) + Core L#306 + PU L#306 (P#306)
L2 L#307 (1024KB) + L1d L#307 (32KB) + L1i L#307 (32KB) + Core L#307 + PU L#307 (P#307)
L2 L#308 (1024KB) + L1d L#308 (32KB) + L1i L#308 (32KB) + Core L#308 + PU L#308 (P#308)
L2 L#309 (1024KB) + L1d L#309 (32KB) + L1i L#309 (32KB) + Core L#309 + PU L#309 (P#309)
L2 L#310 (1024KB) + L1d L#310 (32KB) + L1i L#310 (32KB) + Core L#310 + PU L#310 (P#310)
L2 L#311 (1024KB) + L1d L#311 (32KB) + L1i L#311 (32KB) + Core L#311 + PU L#311 (P#311)
HostBridge
PCIBridge
PCI c1:00.0 (InfiniBand)
Net "ibp193s0"
OpenFabrics "mlx5_2"
Group0 L#13
NUMANode L#13 (P#13 31GB)
L3 L#39 (32MB)
L2 L#312 (1024KB) + L1d L#312 (32KB) + L1i L#312 (32KB) + Core L#312 + PU L#312 (P#312)
L2 L#313 (1024KB) + L1d L#313 (32KB) + L1i L#313 (32KB) + Core L#313 + PU L#313 (P#313)
L2 L#314 (1024KB) + L1d L#314 (32KB) + L1i L#314 (32KB) + Core L#314 + PU L#314 (P#314)
L2 L#315 (1024KB) + L1d L#315 (32KB) + L1i L#315 (32KB) + Core L#315 + PU L#315 (P#315)
L2 L#316 (1024KB) + L1d L#316 (32KB) + L1i L#316 (32KB) + Core L#316 + PU L#316 (P#316)
L2 L#317 (1024KB) + L1d L#317 (32KB) + L1i L#317 (32KB) + Core L#317 + PU L#317 (P#317)
L2 L#318 (1024KB) + L1d L#318 (32KB) + L1i L#318 (32KB) + Core L#318 + PU L#318 (P#318)
L2 L#319 (1024KB) + L1d L#319 (32KB) + L1i L#319 (32KB) + Core L#319 + PU L#319 (P#319)
L3 L#40 (32MB)
L2 L#320 (1024KB) + L1d L#320 (32KB) + L1i L#320 (32KB) + Core L#320 + PU L#320 (P#320)
L2 L#321 (1024KB) + L1d L#321 (32KB) + L1i L#321 (32KB) + Core L#321 + PU L#321 (P#321)
L2 L#322 (1024KB) + L1d L#322 (32KB) + L1i L#322 (32KB) + Core L#322 + PU L#322 (P#322)
L2 L#323 (1024KB) + L1d L#323 (32KB) + L1i L#323 (32KB) + Core L#323 + PU L#323 (P#323)
L2 L#324 (1024KB) + L1d L#324 (32KB) + L1i L#324 (32KB) + Core L#324 + PU L#324 (P#324)
L2 L#325 (1024KB) + L1d L#325 (32KB) + L1i L#325 (32KB) + Core L#325 + PU L#325 (P#325)
L2 L#326 (1024KB) + L1d L#326 (32KB) + L1i L#326 (32KB) + Core L#326 + PU L#326 (P#326)
L2 L#327 (1024KB) + L1d L#327 (32KB) + L1i L#327 (32KB) + Core L#327 + PU L#327 (P#327)
L3 L#41 (32MB)
L2 L#328 (1024KB) + L1d L#328 (32KB) + L1i L#328 (32KB) + Core L#328 + PU L#328 (P#328)
L2 L#329 (1024KB) + L1d L#329 (32KB) + L1i L#329 (32KB) + Core L#329 + PU L#329 (P#329)
L2 L#330 (1024KB) + L1d L#330 (32KB) + L1i L#330 (32KB) + Core L#330 + PU L#330 (P#330)
L2 L#331 (1024KB) + L1d L#331 (32KB) + L1i L#331 (32KB) + Core L#331 + PU L#331 (P#331)
L2 L#332 (1024KB) + L1d L#332 (32KB) + L1i L#332 (32KB) + Core L#332 + PU L#332 (P#332)
L2 L#333 (1024KB) + L1d L#333 (32KB) + L1i L#333 (32KB) + Core L#333 + PU L#333 (P#333)
L2 L#334 (1024KB) + L1d L#334 (32KB) + L1i L#334 (32KB) + Core L#334 + PU L#334 (P#334)
L2 L#335 (1024KB) + L1d L#335 (32KB) + L1i L#335 (32KB) + Core L#335 + PU L#335 (P#335)
Group0 L#14
NUMANode L#14 (P#14 31GB)
L3 L#42 (32MB)
L2 L#336 (1024KB) + L1d L#336 (32KB) + L1i L#336 (32KB) + Core L#336 + PU L#336 (P#336)
L2 L#337 (1024KB) + L1d L#337 (32KB) + L1i L#337 (32KB) + Core L#337 + PU L#337 (P#337)
L2 L#338 (1024KB) + L1d L#338 (32KB) + L1i L#338 (32KB) + Core L#338 + PU L#338 (P#338)
L2 L#339 (1024KB) + L1d L#339 (32KB) + L1i L#339 (32KB) + Core L#339 + PU L#339 (P#339)
L2 L#340 (1024KB) + L1d L#340 (32KB) + L1i L#340 (32KB) + Core L#340 + PU L#340 (P#340)
L2 L#341 (1024KB) + L1d L#341 (32KB) + L1i L#341 (32KB) + Core L#341 + PU L#341 (P#341)
L2 L#342 (1024KB) + L1d L#342 (32KB) + L1i L#342 (32KB) + Core L#342 + PU L#342 (P#342)
L2 L#343 (1024KB) + L1d L#343 (32KB) + L1i L#343 (32KB) + Core L#343 + PU L#343 (P#343)
L3 L#43 (32MB)
L2 L#344 (1024KB) + L1d L#344 (32KB) + L1i L#344 (32KB) + Core L#344 + PU L#344 (P#344)
L2 L#345 (1024KB) + L1d L#345 (32KB) + L1i L#345 (32KB) + Core L#345 + PU L#345 (P#345)
L2 L#346 (1024KB) + L1d L#346 (32KB) + L1i L#346 (32KB) + Core L#346 + PU L#346 (P#346)
L2 L#347 (1024KB) + L1d L#347 (32KB) + L1i L#347 (32KB) + Core L#347 + PU L#347 (P#347)
L2 L#348 (1024KB) + L1d L#348 (32KB) + L1i L#348 (32KB) + Core L#348 + PU L#348 (P#348)
L2 L#349 (1024KB) + L1d L#349 (32KB) + L1i L#349 (32KB) + Core L#349 + PU L#349 (P#349)
L2 L#350 (1024KB) + L1d L#350 (32KB) + L1i L#350 (32KB) + Core L#350 + PU L#350 (P#350)
L2 L#351 (1024KB) + L1d L#351 (32KB) + L1i L#351 (32KB) + Core L#351 + PU L#351 (P#351)
L3 L#44 (32MB)
L2 L#352 (1024KB) + L1d L#352 (32KB) + L1i L#352 (32KB) + Core L#352 + PU L#352 (P#352)
L2 L#353 (1024KB) + L1d L#353 (32KB) + L1i L#353 (32KB) + Core L#353 + PU L#353 (P#353)
L2 L#354 (1024KB) + L1d L#354 (32KB) + L1i L#354 (32KB) + Core L#354 + PU L#354 (P#354)
L2 L#355 (1024KB) + L1d L#355 (32KB) + L1i L#355 (32KB) + Core L#355 + PU L#355 (P#355)
L2 L#356 (1024KB) + L1d L#356 (32KB) + L1i L#356 (32KB) + Core L#356 + PU L#356 (P#356)
L2 L#357 (1024KB) + L1d L#357 (32KB) + L1i L#357 (32KB) + Core L#357 + PU L#357 (P#357)
L2 L#358 (1024KB) + L1d L#358 (32KB) + L1i L#358 (32KB) + Core L#358 + PU L#358 (P#358)
L2 L#359 (1024KB) + L1d L#359 (32KB) + L1i L#359 (32KB) + Core L#359 + PU L#359 (P#359)
HostBridge
PCIBridge
PCI e1:00.0 (NVMExp)
Block(Disk) "nvme9n1"
Group0 L#15
NUMANode L#15 (P#15 31GB)
L3 L#45 (32MB)
L2 L#360 (1024KB) + L1d L#360 (32KB) + L1i L#360 (32KB) + Core L#360 + PU L#360 (P#360)
L2 L#361 (1024KB) + L1d L#361 (32KB) + L1i L#361 (32KB) + Core L#361 + PU L#361 (P#361)
L2 L#362 (1024KB) + L1d L#362 (32KB) + L1i L#362 (32KB) + Core L#362 + PU L#362 (P#362)
L2 L#363 (1024KB) + L1d L#363 (32KB) + L1i L#363 (32KB) + Core L#363 + PU L#363 (P#363)
L2 L#364 (1024KB) + L1d L#364 (32KB) + L1i L#364 (32KB) + Core L#364 + PU L#364 (P#364)
L2 L#365 (1024KB) + L1d L#365 (32KB) + L1i L#365 (32KB) + Core L#365 + PU L#365 (P#365)
L2 L#366 (1024KB) + L1d L#366 (32KB) + L1i L#366 (32KB) + Core L#366 + PU L#366 (P#366)
L2 L#367 (1024KB) + L1d L#367 (32KB) + L1i L#367 (32KB) + Core L#367 + PU L#367 (P#367)
L3 L#46 (32MB)
L2 L#368 (1024KB) + L1d L#368 (32KB) + L1i L#368 (32KB) + Core L#368 + PU L#368 (P#368)
L2 L#369 (1024KB) + L1d L#369 (32KB) + L1i L#369 (32KB) + Core L#369 + PU L#369 (P#369)
L2 L#370 (1024KB) + L1d L#370 (32KB) + L1i L#370 (32KB) + Core L#370 + PU L#370 (P#370)
L2 L#371 (1024KB) + L1d L#371 (32KB) + L1i L#371 (32KB) + Core L#371 + PU L#371 (P#371)
L2 L#372 (1024KB) + L1d L#372 (32KB) + L1i L#372 (32KB) + Core L#372 + PU L#372 (P#372)
L2 L#373 (1024KB) + L1d L#373 (32KB) + L1i L#373 (32KB) + Core L#373 + PU L#373 (P#373)
L2 L#374 (1024KB) + L1d L#374 (32KB) + L1i L#374 (32KB) + Core L#374 + PU L#374 (P#374)
L2 L#375 (1024KB) + L1d L#375 (32KB) + L1i L#375 (32KB) + Core L#375 + PU L#375 (P#375)
L3 L#47 (32MB)
L2 L#376 (1024KB) + L1d L#376 (32KB) + L1i L#376 (32KB) + Core L#376 + PU L#376 (P#376)
L2 L#377 (1024KB) + L1d L#377 (32KB) + L1i L#377 (32KB) + Core L#377 + PU L#377 (P#377)
L2 L#378 (1024KB) + L1d L#378 (32KB) + L1i L#378 (32KB) + Core L#378 + PU L#378 (P#378)
L2 L#379 (1024KB) + L1d L#379 (32KB) + L1i L#379 (32KB) + Core L#379 + PU L#379 (P#379)
L2 L#380 (1024KB) + L1d L#380 (32KB) + L1i L#380 (32KB) + Core L#380 + PU L#380 (P#380)
L2 L#381 (1024KB) + L1d L#381 (32KB) + L1i L#381 (32KB) + Core L#381 + PU L#381 (P#381)
L2 L#382 (1024KB) + L1d L#382 (32KB) + L1i L#382 (32KB) + Core L#382 + PU L#382 (P#382)
L2 L#383 (1024KB) + L1d L#383 (32KB) + L1i L#383 (32KB) + Core L#383 + PU L#383 (P#383)
HostBridge
PCIBridge
PCI f1:00.0 (NVMExp)
Block(Disk) "nvme7n1"
Unfortunately I can not share the cpu specs publicly. However this is a CPU with regular x86 based cores.
We were wondering if there is a way to limit ucx to use only the nearest NIC without spanning over several nics, without using extra environment variables that might not be possible to use in general.
Can you please take a look at the UCX logs (shared at the top of this report) and let us know if our suspicion, that more than one NIC is being used for transfers, is correct?
Hi @arun-chandran-edarath
Thank you very much for the suggestion. Our workloads mostly fully populate the entire node (in this case 384 ranks per node) so this didn't help us (i.e. using a new build of ucx and setting UCX_NT_BUFFER_TRANSFER_MIN=0), however it is an excellent work.
@arstgr Thank you so much for trying it out, yes, in its current form, NT_BUFFER_TRANSFER is set up to help hybrid MPI workloads (1 rank in L3)
@arstgr Thank you so much for trying it out, yes, in its current form, NT_BUFFER_TRANSFER is set up to help hybrid MPI workloads (1 rank in L3)
I missed one important point: if the buffer size being transferred is more than three-fourths of the L3 cache size, NT_BUFFER_TRANSFER should also help full-rank MPI workloads.
Hi @arstgr,
We were wondering if there is a way to limit ucx to use only the nearest NIC without spanning over several nics, without using extra environment variables that might not be possible to use in general.
The main method of applying WA in UCX is by setting env vars to change default behaviour. If that's not an option, then we'll need to further investigate this issue, and if needed, provide a fix in next release.
Can you please take a look at the UCX logs (shared at the top of this report) and let us know if our suspicion, that more than one NIC is being used for transfers, is correct?
That's correct, the number of NICs used for transfer is determined by UCX_MAX_RNDV_RAILS, which has a default value of 2 for NDR.
If you wish to further investigate, please send the MPI command line + output results, so we can better understand the root issue.
Hi @shasson5
Thanks for looking into this issue. I think the default behavior for UCX_MAX_RNDV_RAILS should be to either use multiple virtual lanes within the same adapter or multiple physical adapters if they are on the same package. When this is not the case (i.e. multiple physical adapters on different packages are used) there is always a performance hit, as is the case for our current test environment.
The MPI command line to reproduce this, along with the output results and the ucx log files are listed at the top of this bug report.
Hi @arstgr
There shouldn't be any performance hit as a result of using NICs on remote package (as long as local NICs are prioritized).
The results you listed above are for osu_bibw. According to the output you get 90MB/s when running with default params (multiple NICs), and only 50MB/s when running with 1 NIC.
So I cannot understand where exactly you see a degradation. Am I missing something?
Hi @arstgr we need your response, if the issue is still relevant.
No customer response, closing.