ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCX using multiple interfaces causing performance to drop due to higher latency

Open arstgr opened this issue 8 months ago • 13 comments

Describe the bug

On a system with nodes made of quad socket CPUs where each socket has its own dedicated NIC, UCX seems to be using more than one NIC for a given mpi rank. Due to the high latency for crossing sockets, this leads to performance degradations. The NICs on this system are NDR IB NICs operating at 200Gbps, the peak bi-directional bandwidth is supposed to be 50 MB/s. When we run OSU's Bi-BW test, however, we see up to 90 MB/s of bandwidth. Investigation of ucx logs (obtained with UCX_LOG_LEVEL=data) for this test suggests all 4 NICs are being used for transfers.

When we enforce using 1 NIC only, e.g. using UCX_NET_DEVICES=mlx5_3:1 with proper pinning, then Bi-BW obtained from the OSU test peaks at 50 MB/s. Investigation of the logs suggest that only the specified NIC is being used for the transfers. However using only 1 NIC adversely affects the scaling efficiency of workloads.

To mitigate further, we used UCX_MAX_RNDV_RAILS=1 (with no other ucx environment variable), this also brings down the Bi-BW obtained with the OSU test back to 50 MB/s. However investigation of the logs suggest still 4 NICs are being used.

UCX logs are attached here

with -x UCX_MAX_RNDV_RAILS=1 with -x UCX_NET_DEVICES=mlx5_3:1 Default with no extra UCX env variables

Steps to Reproduce

Default case

  • Command line
mpirun -np 2 --map-by ppr:1:node -x PATH -x LD_LIBRARY_PATH -H hpc-04:1,hpc-02:1 ./osu_bibw

# OSU MPI Bi-Directional Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       8.46
2                      17.32
4                      34.04
8                      68.96
16                    139.40
32                    284.47
64                    473.55
128                   856.83
256                  1480.08
512                  2892.64
1024                 5554.62
2048                 9243.91
4096                16360.18
8192                25914.67
16384               35251.82
32768               61789.94
65536               68835.51
131072              83308.48
262144              90082.03
524288              91742.73
1048576             91501.88
2097152             91421.49
4194304             90710.37

Using UCX_NET_DEVICES=mlx5_3:1

mpirun -np 2 --map-by ppr:1:node -x PATH -x LD_LIBRARY_PATH -H hpc-04:1,hpc-02:1 -x UCX_NET_DEVICES=mlx5_3:1 ./osu_bibw

# OSU MPI Bi-Directional Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       8.82
2                      18.07
4                      36.88
8                      71.19
16                    144.49
32                    284.73
64                    464.56
128                   846.72
256                  1530.21
512                  2991.00
1024                 5977.21
2048                10069.56
4096                17430.70
8192                26506.61
16384               35462.24
32768               41263.31
65536               44711.44
131072              47028.48
262144              48088.30
524288              48895.34
1048576             49181.93
2097152             49328.65
4194304             49400.91

Using UCX_MAX_RNDV_RAILS=1

mpirun -np 2 --map-by ppr:1:node -x PATH -x LD_LIBRARY_PATH -H hpc-04:1,hpc-02:1 -x UCX_MAX_RNDV_RAILS=1 ./osu_bibw

# OSU MPI Bi-Directional Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       8.92
2                      18.40
4                      36.52
8                      70.81
16                    149.20
32                    291.59
64                    464.90
128                   855.56
256                  1497.61
512                  2886.56
1024                 6107.93
2048                 9917.07
4096                17107.22
8192                26048.81
16384               35111.41
32768               41246.49
65536               43126.46
131072              47215.79
262144              48303.64
524288              48877.88
1048576             48830.47
2097152             49324.69
4194304             49400.70
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
ucx_info -v
# Library version: 1.18.0
# Library path: /opt/hpcx-v2.21.2-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx/mt/lib/libucs.so.0
# API headers version: 1.18.0
# Git branch '', revision 152bf42
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --with-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.6.1/redhat8 --with-gdrcopy --prefix=/build-result/hpcx-v2.21.2-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37/redhat7
  • Any UCX environment variables used

Separately tested with default parameters (i.e. no ucx environment variables), with UCX_NET_DEVICES specified, and with UCX_MAX_RNDV_RAILS=1

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
AlmaLinux release 9.5 (Teal Serval)
NAME="AlmaLinux"
VERSION="9.5 (Teal Serval)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.5"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.5 (Teal Serval)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.5"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.5"
SUPPORT_END=2032-06-01
AlmaLinux release 9.5 (Teal Serval)
AlmaLinux release 9.5 (Teal Serval)
  • For RDMA/IB/RoCE related issues:
    • Driver version:
ofed_info -s
OFED-internal-25.01-0.6.0:
  • HW information from ibstat or ibv_devinfo -vv command
ibstat
CA 'mlx5_0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.44.1204
        Hardware version: 0
        Node GUID: 0xb83fd2030075786a
        System image GUID: 0xb83fd2030075786a
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 13
                LMC: 0
                SM lid: 1
                Capability mask: 0xa751e848
                Port GUID: 0xb83fd2030075786a
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.44.1204
        Hardware version: 0
        Node GUID: 0xb83fd2030085e31e
        System image GUID: 0xb83fd2030085e31e
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 3
                LMC: 0
                SM lid: 1
                Capability mask: 0xa751e848
                Port GUID: 0xb83fd2030085e31e
                Link layer: InfiniBand
CA 'mlx5_2'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.44.1204
        Hardware version: 0
        Node GUID: 0xb83fd203008b8c5a
        System image GUID: 0xb83fd203008b8c5a
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 11
                LMC: 0
                SM lid: 1
                Capability mask: 0xa751e848
                Port GUID: 0xb83fd203008b8c5a
                Link layer: InfiniBand
CA 'mlx5_3'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.44.1204
        Hardware version: 0
        Node GUID: 0xb83fd203008b8c58
        System image GUID: 0xb83fd203008b8c58
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 10
                LMC: 0
                SM lid: 1
                Capability mask: 0xa751e848
                Port GUID: 0xb83fd203008b8c58
                Link layer: InfiniBand

Additional information (depending on the issue)

  • OpenMPI version
ompi_info
                 Package: Open MPI root@hpc-kernel-03 Distribution
                Open MPI: 4.1.7rc1
  • Output of ucx_info -d to show transports and devices recognized by UCX
ucx_info -d
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#           rkey_ptr is supported
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: self
#         Device: memory
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 19360.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: tcp
#         Device: ens1f1
#           Type: network
#  System device: ens1f1 (0)
#
#      capabilities:
#            bandwidth: 11.32/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 262919740K
#           remote key: 24 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: mlx5_0
#     Component: ib
#             allocate: <= 256K
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (access,reg,cache), rdma (alloc,cache)
#
#      Transport: rc_verbs
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (1)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 70
#     device num paths: 2
#              max eps: 256
#       device address: 3 bytes
#           ep address: 7 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (1)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3992
#           connection: to ep, to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: dc_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (1)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 660 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#      Transport: rc_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (1)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 70
#     device num paths: 2
#              max eps: 256
#       device address: 3 bytes
#           ep address: 10 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (1)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_1
#     Component: ib
#             allocate: <= 256K
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (access,reg,cache), rdma (alloc,cache)
#
#      Transport: rc_verbs
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (2)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 70
#     device num paths: 2
#              max eps: 256
#       device address: 3 bytes
#           ep address: 7 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (2)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3992
#           connection: to ep, to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: dc_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (2)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 660 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#      Transport: rc_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (2)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 70
#     device num paths: 2
#              max eps: 256
#       device address: 3 bytes
#           ep address: 10 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (2)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_2
#     Component: ib
#             allocate: <= 256K
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (access,reg,cache), rdma (alloc,cache)
#
#      Transport: rc_verbs
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (3)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 70
#     device num paths: 2
#              max eps: 256
#       device address: 3 bytes
#           ep address: 7 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (3)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3992
#           connection: to ep, to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: dc_mlx5
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (3)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 660 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#      Transport: rc_mlx5
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (3)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 70
#     device num paths: 2
#              max eps: 256
#       device address: 3 bytes
#           ep address: 10 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_2:1
#           Type: network
#  System device: mlx5_2 (3)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_3
#     Component: ib
#             allocate: <= 256K
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (access,reg,cache), rdma (alloc,cache)
#
#      Transport: rc_verbs
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (4)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 70
#     device num paths: 2
#              max eps: 256
#       device address: 3 bytes
#           ep address: 7 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (4)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3992
#           connection: to ep, to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: dc_mlx5
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (4)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 660 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#      Transport: rc_mlx5
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (4)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 600 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 70
#     device num paths: 2
#              max eps: 256
#       device address: 3 bytes
#           ep address: 10 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_3:1
#           Type: network
#  System device: mlx5_3 (4)
#
#      capabilities:
#            bandwidth: 22873.66/ppn + 0.00 MB/sec
#              latency: 630 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 70
#     device num paths: 2
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_0
#     Component: gga
#             allocate: <= 256K
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (access,reg,cache), rdma (alloc,cache)
#   < no supported devices found >
#
# Memory domain: mlx5_1
#     Component: gga
#             allocate: <= 256K
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (access,reg,cache), rdma (alloc,cache)
#   < no supported devices found >
#
# Memory domain: mlx5_2
#     Component: gga
#             allocate: <= 256K
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (access,reg,cache), rdma (alloc,cache)
#   < no supported devices found >
#
# Memory domain: mlx5_3
#     Component: gga
#             allocate: <= 256K
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (access,reg,cache), rdma (alloc,cache)
#   < no supported devices found >
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
  • Configure result - config.log N/A
  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data" Attached

arstgr avatar May 05 '25 16:05 arstgr

@yosefe for review and consideration. Many thanks in advance

arstgr avatar May 05 '25 16:05 arstgr

Hi @arstgr

I'm not sure I understand to which degradation you reffer. As you described, running osu_bibw with multiple NICs resulted in better performance compared to 1 NIC (50 GB/s => 90 GB/s). Crossing sockets may be the cause for not reaching full wire speed, but this is expected when running a simple point to point benchmark.

Can you please explain what is your expected performance in this scenario?

shasson5 avatar May 14 '25 18:05 shasson5

Hi @shasson5 Thank you for looking into this, we highly appreciate it.

We see a performance degradation (i.e. low scaling efficiency) when running mpi jobs each node of the system has 4 sockets and 4 nics, each nic assigned to a single socket When we try to troubleshoot, we see limiting the number of rndv rails to 1 resolves the issue. We suspect each mpi rank is using more than 1 nic (with the cost of crossing a socket being high in this system, since that leads to high latency).

We tried to investigate this issue further using the osu benchmarks. When running a p2p test like osu_bibw with only 1 rank per node, we see a) the bandwidth is higher than the theoretical limit for 1 nic (we suspect this points to multiple nics being used) b) the ucx logs suggest qps are attached to multiple nics c) limiting the mpi rank to use only 1 nic resolves the issue, reducing the bw to the limit for 1 nic d) since there is high latency for crossing sockets, when multiple nics are used per mpi rank the performance is hit, due to the higher latencies

We need help understanding the source of this issue further We need help finding a mitigation that works even for applications that have their own mpi wrappers

Our current solution is to limit the number of rndv rails to 1, and assign mpi ranks to their own nics during the launch time

arstgr avatar May 14 '25 19:05 arstgr

Hi @arstgr Thanks for the detailed information.

Can you please try to simultaneously run the following commands on some node: taskset -c 0 ib_write_bw -d mlx5_0 -b --run_infinitely -F -p 18514 taskset -c 1 ib_write_bw -d mlx5_1 -b --run_infinitely -F -p 18515

and the following commands on another node: taskset -c 0 ib_write_bw -d mlx5_0 -b --run_infinitely -F -p 18514 <node1> taskset -c 1 ib_write_bw -d mlx5_1 -b --run_infinitely -F -p 18515 <node1>

where <node1> should be replaced with IP/hostname of first node. You can download ib_write_bw from https://github.com/linux-rdma/perftest if needed.

Also, can you share CPU arch information?

shasson5 avatar May 15 '25 16:05 shasson5

Hi @shasson5

Thank you so much for looking into this issue. This is a quad socket system with each socket having 96 cores (x86 cores). Each socket has 1 IB NIC assigned to it. The order is a bit skewed: socket 0 cores 0-95 -> mlx5_3 socket 1 cores 96-191 -> mlx5_0 socket 2 cores 192-287 -> mlx5_1 socket 3 cores 288-383 -> mlx5_2

I ran 2 sets of tests (I wasn't sure what is the intention so tried to do it as broadly as possible). In the first set, the ib_write_bw test was pinned to core 0 (on socket 0 using mlx5_3) and 96 (on socket 1 using mlx5_0), here are the results

 WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
                    RDMA_Write Bidirectional BW Test
 Dual-port       : OFF          Device         : mlx5_3
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x17 QPN 0x00b2 PSN 0x5dcd59 RKey 0x1fff0c VAddr 0x0014f8d6631000
 remote address: LID 0x0e QPN 0x5f58 PSN 0x6c8156 RKey 0x00b1e9 VAddr 0x0014e31a47a000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      1889032          0.00               47074.70                     0.753195
 65536      1883135          0.00               47080.23                     0.753284
 65536      1883139          0.00               47080.30                     0.753285
 65536      1883111          0.00               47079.79                     0.753277
 65536      1883108          0.00               47080.02                     0.753280

and

 WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
                    RDMA_Write Bidirectional BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0b QPN 0x00b2 PSN 0xd76860 RKey 0x1fff0c VAddr 0x00149901ed9000
 remote address: LID 0x11 QPN 0x5fc5 PSN 0xa55ff8 RKey 0x0213f4 VAddr 0x00153e55c24000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      1887749          0.00               47080.65                     0.753290
 65536      1883341          0.00               47085.11                     0.753362
 65536      1883338          0.00               47084.90                     0.753358
 65536      1883334          0.00               47084.75                     0.753356
 65536      1883314          0.00               47084.31                     0.753349

In the second set, the ib_write_bw test was pinned to core 0 (on socket 0 using mlx5_3) and core 1 (on socket 0 but using mlx5_0), here are the results

 WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
                    RDMA_Write Bidirectional BW Test
 Dual-port       : OFF          Device         : mlx5_3
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x17 QPN 0x00b3 PSN 0x7aa19c RKey 0x1fff00 VAddr 0x001502ec3ad000
 remote address: LID 0x0e QPN 0x5f59 PSN 0x260c1 RKey 0x00b1ea VAddr 0x0014649fc55000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      1887500          0.00               47074.11                     0.753186
 65536      1882996          0.00               47076.18                     0.753219
 65536      1883009          0.00               47077.02                     0.753232
 65536      1883026          0.00               47077.14                     0.753234
 65536      1883015          0.00               47076.85                     0.753230

and

 WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
                    RDMA_Write Bidirectional BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0b QPN 0x00b3 PSN 0x15669 RKey 0x1fff00 VAddr 0x0014b7a76a7000
 remote address: LID 0x11 QPN 0x5fc6 PSN 0xdbbfac RKey 0x0213f5 VAddr 0x0014708766d000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      1795328          0.00               44703.13                     0.715250
 65536      1791620          0.00               44791.44                     0.716663
 65536      1793567          0.00               44840.12                     0.717442
 65536      1793313          0.00               44834.22                     0.717347
 65536      1793457          0.00               44837.83                     0.717405

Interestingly, the one where core 1 on socket 0 is using mlx5_0 (on socket 1) shows lower BW.

We really appreciate your help. Please let us know if there is any additional info that you need. Thank you very much

arstgr avatar May 15 '25 22:05 arstgr

Hi @arstgr

Thanks for running the tests, indeed the second set is the relevant one. Can you please also run the following commands:

  1. numactl -H
  2. cat /proc/cpuinfo
  3. lscpu
  4. lstopo (installed from package hwloc)

Also, can you share the original MPI command you used and results? If this is a custom MPI application, what is the traffic pattern used (one to many, many to many, multiple pairs)? Please add --disaplay-map flag to the MPI command that you run, so it would be more verbose for debugging.

Thanks

shasson5 avatar May 18 '25 10:05 shasson5

@arstgr To reduce latency in CPU-involved point-to-point transfers, we have implemented the changes in https://github.com/openucx/ucx/pull/9408. Please give it a try.

arun-chandran-edarath avatar May 19 '25 07:05 arun-chandran-edarath

Hi @shasson5 Thank you very much for your help.

Here is the numa config

numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 30755 MB
node 0 free: 30035 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 32247 MB
node 1 free: 31891 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 32247 MB
node 2 free: 31925 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 32247 MB
node 3 free: 31958 MB
node 4 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 4 size: 31957 MB
node 4 free: 31325 MB
node 5 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 5 size: 32247 MB
node 5 free: 31827 MB
node 6 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
node 6 size: 32247 MB
node 6 free: 31916 MB
node 7 cpus: 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 7 size: 32247 MB
node 7 free: 31950 MB
node 8 cpus: 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
node 8 size: 31957 MB
node 8 free: 31423 MB
node 9 cpus: 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 9 size: 32247 MB
node 9 free: 31894 MB
node 10 cpus: 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263
node 10 size: 32247 MB
node 10 free: 31907 MB
node 11 cpus: 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
node 11 size: 32205 MB
node 11 free: 31897 MB
node 12 cpus: 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311
node 12 size: 31957 MB
node 12 free: 31279 MB
node 13 cpus: 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335
node 13 size: 32247 MB
node 13 free: 31891 MB
node 14 cpus: 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359
node 14 size: 32247 MB
node 14 free: 31907 MB
node 15 cpus: 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383
node 15 size: 32223 MB
node 15 free: 31863 MB
node distances:
node     0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
   0:   10   12   12   12   32   32   32   32   32   32   32   32   32   32   32   32
   1:   12   10   12   12   32   32   32   32   32   32   32   32   32   32   32   32
   2:   12   12   10   12   32   32   32   32   32   32   32   32   32   32   32   32
   3:   12   12   12   10   32   32   32   32   32   32   32   32   32   32   32   32
   4:   32   32   32   32   10   12   12   12   32   32   32   32   32   32   32   32
   5:   32   32   32   32   12   10   12   12   32   32   32   32   32   32   32   32
   6:   32   32   32   32   12   12   10   12   32   32   32   32   32   32   32   32
   7:   32   32   32   32   12   12   12   10   32   32   32   32   32   32   32   32
   8:   32   32   32   32   32   32   32   32   10   12   12   12   32   32   32   32
   9:   32   32   32   32   32   32   32   32   12   10   12   12   32   32   32   32
  10:   32   32   32   32   32   32   32   32   12   12   10   12   32   32   32   32
  11:   32   32   32   32   32   32   32   32   12   12   12   10   32   32   32   32
  12:   32   32   32   32   32   32   32   32   32   32   32   32   10   12   12   12
  13:   32   32   32   32   32   32   32   32   32   32   32   32   12   10   12   12
  14:   32   32   32   32   32   32   32   32   32   32   32   32   12   12   10   12
  15:   32   32   32   32   32   32   32   32   32   32   32   32   12   12   12   10

and lstopo

lstopo-no-graphics
Machine (502GB total)
  Package L#0
    Group0 L#0
      NUMANode L#0 (P#0 30GB)
      L3 L#0 (32MB)
        L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
        L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
        L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
        L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
        L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
        L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
        L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
        L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L3 L#1 (32MB)
        L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
        L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
        L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
        L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
        L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
        L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
        L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
        L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
      L3 L#2 (32MB)
        L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
        L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
        L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
        L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
        L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
        L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
        L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
        L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
      HostBridge
        PCIBridge
          PCI 01:00.0 (InfiniBand)
            Net "ibp1s0"
            OpenFabrics "mlx5_3"
    Group0 L#1
      NUMANode L#1 (P#1 31GB)
      L3 L#3 (32MB)
        L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
        L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
        L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
        L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
        L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28)
        L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29)
        L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30)
        L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)
      L3 L#4 (32MB)
        L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32)
        L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33)
        L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34)
        L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35)
        L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36)
        L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37)
        L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38)
        L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39)
      L3 L#5 (32MB)
        L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40)
        L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41)
        L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42)
        L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43)
        L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44)
        L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45)
        L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46)
        L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47)
      HostBridge
        PCIBridge
          PCIBridge
            PCI 12:00.0 (VGA)
        PCIBridge
          PCI 13:00.0 (NVMExp)
            Block(Disk) "nvme0n1"
        PCIBridge
          PCI 15:00.0 (NVMExp)
            Block(Disk) "nvme5n1"
    Group0 L#2
      NUMANode L#2 (P#2 31GB)
      L3 L#6 (32MB)
        L2 L#48 (1024KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48 + PU L#48 (P#48)
        L2 L#49 (1024KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49 + PU L#49 (P#49)
        L2 L#50 (1024KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50 + PU L#50 (P#50)
        L2 L#51 (1024KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51 + PU L#51 (P#51)
        L2 L#52 (1024KB) + L1d L#52 (32KB) + L1i L#52 (32KB) + Core L#52 + PU L#52 (P#52)
        L2 L#53 (1024KB) + L1d L#53 (32KB) + L1i L#53 (32KB) + Core L#53 + PU L#53 (P#53)
        L2 L#54 (1024KB) + L1d L#54 (32KB) + L1i L#54 (32KB) + Core L#54 + PU L#54 (P#54)
        L2 L#55 (1024KB) + L1d L#55 (32KB) + L1i L#55 (32KB) + Core L#55 + PU L#55 (P#55)
      L3 L#7 (32MB)
        L2 L#56 (1024KB) + L1d L#56 (32KB) + L1i L#56 (32KB) + Core L#56 + PU L#56 (P#56)
        L2 L#57 (1024KB) + L1d L#57 (32KB) + L1i L#57 (32KB) + Core L#57 + PU L#57 (P#57)
        L2 L#58 (1024KB) + L1d L#58 (32KB) + L1i L#58 (32KB) + Core L#58 + PU L#58 (P#58)
        L2 L#59 (1024KB) + L1d L#59 (32KB) + L1i L#59 (32KB) + Core L#59 + PU L#59 (P#59)
        L2 L#60 (1024KB) + L1d L#60 (32KB) + L1i L#60 (32KB) + Core L#60 + PU L#60 (P#60)
        L2 L#61 (1024KB) + L1d L#61 (32KB) + L1i L#61 (32KB) + Core L#61 + PU L#61 (P#61)
        L2 L#62 (1024KB) + L1d L#62 (32KB) + L1i L#62 (32KB) + Core L#62 + PU L#62 (P#62)
        L2 L#63 (1024KB) + L1d L#63 (32KB) + L1i L#63 (32KB) + Core L#63 + PU L#63 (P#63)
      L3 L#8 (32MB)
        L2 L#64 (1024KB) + L1d L#64 (32KB) + L1i L#64 (32KB) + Core L#64 + PU L#64 (P#64)
        L2 L#65 (1024KB) + L1d L#65 (32KB) + L1i L#65 (32KB) + Core L#65 + PU L#65 (P#65)
        L2 L#66 (1024KB) + L1d L#66 (32KB) + L1i L#66 (32KB) + Core L#66 + PU L#66 (P#66)
        L2 L#67 (1024KB) + L1d L#67 (32KB) + L1i L#67 (32KB) + Core L#67 + PU L#67 (P#67)
        L2 L#68 (1024KB) + L1d L#68 (32KB) + L1i L#68 (32KB) + Core L#68 + PU L#68 (P#68)
        L2 L#69 (1024KB) + L1d L#69 (32KB) + L1i L#69 (32KB) + Core L#69 + PU L#69 (P#69)
        L2 L#70 (1024KB) + L1d L#70 (32KB) + L1i L#70 (32KB) + Core L#70 + PU L#70 (P#70)
        L2 L#71 (1024KB) + L1d L#71 (32KB) + L1i L#71 (32KB) + Core L#71 + PU L#71 (P#71)
      HostBridge
        PCIBridge
          PCI 21:00.0 (NVMExp)
            Block(Disk) "nvme8n1"
    Group0 L#3
      NUMANode L#3 (P#3 31GB)
      L3 L#9 (32MB)
        L2 L#72 (1024KB) + L1d L#72 (32KB) + L1i L#72 (32KB) + Core L#72 + PU L#72 (P#72)
        L2 L#73 (1024KB) + L1d L#73 (32KB) + L1i L#73 (32KB) + Core L#73 + PU L#73 (P#73)
        L2 L#74 (1024KB) + L1d L#74 (32KB) + L1i L#74 (32KB) + Core L#74 + PU L#74 (P#74)
        L2 L#75 (1024KB) + L1d L#75 (32KB) + L1i L#75 (32KB) + Core L#75 + PU L#75 (P#75)
        L2 L#76 (1024KB) + L1d L#76 (32KB) + L1i L#76 (32KB) + Core L#76 + PU L#76 (P#76)
        L2 L#77 (1024KB) + L1d L#77 (32KB) + L1i L#77 (32KB) + Core L#77 + PU L#77 (P#77)
        L2 L#78 (1024KB) + L1d L#78 (32KB) + L1i L#78 (32KB) + Core L#78 + PU L#78 (P#78)
        L2 L#79 (1024KB) + L1d L#79 (32KB) + L1i L#79 (32KB) + Core L#79 + PU L#79 (P#79)
      L3 L#10 (32MB)
        L2 L#80 (1024KB) + L1d L#80 (32KB) + L1i L#80 (32KB) + Core L#80 + PU L#80 (P#80)
        L2 L#81 (1024KB) + L1d L#81 (32KB) + L1i L#81 (32KB) + Core L#81 + PU L#81 (P#81)
        L2 L#82 (1024KB) + L1d L#82 (32KB) + L1i L#82 (32KB) + Core L#82 + PU L#82 (P#82)
        L2 L#83 (1024KB) + L1d L#83 (32KB) + L1i L#83 (32KB) + Core L#83 + PU L#83 (P#83)
        L2 L#84 (1024KB) + L1d L#84 (32KB) + L1i L#84 (32KB) + Core L#84 + PU L#84 (P#84)
        L2 L#85 (1024KB) + L1d L#85 (32KB) + L1i L#85 (32KB) + Core L#85 + PU L#85 (P#85)
        L2 L#86 (1024KB) + L1d L#86 (32KB) + L1i L#86 (32KB) + Core L#86 + PU L#86 (P#86)
        L2 L#87 (1024KB) + L1d L#87 (32KB) + L1i L#87 (32KB) + Core L#87 + PU L#87 (P#87)
      L3 L#11 (32MB)
        L2 L#88 (1024KB) + L1d L#88 (32KB) + L1i L#88 (32KB) + Core L#88 + PU L#88 (P#88)
        L2 L#89 (1024KB) + L1d L#89 (32KB) + L1i L#89 (32KB) + Core L#89 + PU L#89 (P#89)
        L2 L#90 (1024KB) + L1d L#90 (32KB) + L1i L#90 (32KB) + Core L#90 + PU L#90 (P#90)
        L2 L#91 (1024KB) + L1d L#91 (32KB) + L1i L#91 (32KB) + Core L#91 + PU L#91 (P#91)
        L2 L#92 (1024KB) + L1d L#92 (32KB) + L1i L#92 (32KB) + Core L#92 + PU L#92 (P#92)
        L2 L#93 (1024KB) + L1d L#93 (32KB) + L1i L#93 (32KB) + Core L#93 + PU L#93 (P#93)
        L2 L#94 (1024KB) + L1d L#94 (32KB) + L1i L#94 (32KB) + Core L#94 + PU L#94 (P#94)
        L2 L#95 (1024KB) + L1d L#95 (32KB) + L1i L#95 (32KB) + Core L#95 + PU L#95 (P#95)
      HostBridge
        PCIBridge
          PCI 31:00.0 (NVMExp)
            Block(Disk) "nvme2n1"
  Package L#1
    Group0 L#4
      NUMANode L#4 (P#4 31GB)
      L3 L#12 (32MB)
        L2 L#96 (1024KB) + L1d L#96 (32KB) + L1i L#96 (32KB) + Core L#96 + PU L#96 (P#96)
        L2 L#97 (1024KB) + L1d L#97 (32KB) + L1i L#97 (32KB) + Core L#97 + PU L#97 (P#97)
        L2 L#98 (1024KB) + L1d L#98 (32KB) + L1i L#98 (32KB) + Core L#98 + PU L#98 (P#98)
        L2 L#99 (1024KB) + L1d L#99 (32KB) + L1i L#99 (32KB) + Core L#99 + PU L#99 (P#99)
        L2 L#100 (1024KB) + L1d L#100 (32KB) + L1i L#100 (32KB) + Core L#100 + PU L#100 (P#100)
        L2 L#101 (1024KB) + L1d L#101 (32KB) + L1i L#101 (32KB) + Core L#101 + PU L#101 (P#101)
        L2 L#102 (1024KB) + L1d L#102 (32KB) + L1i L#102 (32KB) + Core L#102 + PU L#102 (P#102)
        L2 L#103 (1024KB) + L1d L#103 (32KB) + L1i L#103 (32KB) + Core L#103 + PU L#103 (P#103)
      L3 L#13 (32MB)
        L2 L#104 (1024KB) + L1d L#104 (32KB) + L1i L#104 (32KB) + Core L#104 + PU L#104 (P#104)
        L2 L#105 (1024KB) + L1d L#105 (32KB) + L1i L#105 (32KB) + Core L#105 + PU L#105 (P#105)
        L2 L#106 (1024KB) + L1d L#106 (32KB) + L1i L#106 (32KB) + Core L#106 + PU L#106 (P#106)
        L2 L#107 (1024KB) + L1d L#107 (32KB) + L1i L#107 (32KB) + Core L#107 + PU L#107 (P#107)
        L2 L#108 (1024KB) + L1d L#108 (32KB) + L1i L#108 (32KB) + Core L#108 + PU L#108 (P#108)
        L2 L#109 (1024KB) + L1d L#109 (32KB) + L1i L#109 (32KB) + Core L#109 + PU L#109 (P#109)
        L2 L#110 (1024KB) + L1d L#110 (32KB) + L1i L#110 (32KB) + Core L#110 + PU L#110 (P#110)
        L2 L#111 (1024KB) + L1d L#111 (32KB) + L1i L#111 (32KB) + Core L#111 + PU L#111 (P#111)
      L3 L#14 (32MB)
        L2 L#112 (1024KB) + L1d L#112 (32KB) + L1i L#112 (32KB) + Core L#112 + PU L#112 (P#112)
        L2 L#113 (1024KB) + L1d L#113 (32KB) + L1i L#113 (32KB) + Core L#113 + PU L#113 (P#113)
        L2 L#114 (1024KB) + L1d L#114 (32KB) + L1i L#114 (32KB) + Core L#114 + PU L#114 (P#114)
        L2 L#115 (1024KB) + L1d L#115 (32KB) + L1i L#115 (32KB) + Core L#115 + PU L#115 (P#115)
        L2 L#116 (1024KB) + L1d L#116 (32KB) + L1i L#116 (32KB) + Core L#116 + PU L#116 (P#116)
        L2 L#117 (1024KB) + L1d L#117 (32KB) + L1i L#117 (32KB) + Core L#117 + PU L#117 (P#117)
        L2 L#118 (1024KB) + L1d L#118 (32KB) + L1i L#118 (32KB) + Core L#118 + PU L#118 (P#118)
        L2 L#119 (1024KB) + L1d L#119 (32KB) + L1i L#119 (32KB) + Core L#119 + PU L#119 (P#119)
      HostBridge
        PCIBridge
          PCI 41:00.0 (InfiniBand)
            Net "ibp65s0"
            OpenFabrics "mlx5_0"
    Group0 L#5
      NUMANode L#5 (P#5 31GB)
      L3 L#15 (32MB)
        L2 L#120 (1024KB) + L1d L#120 (32KB) + L1i L#120 (32KB) + Core L#120 + PU L#120 (P#120)
        L2 L#121 (1024KB) + L1d L#121 (32KB) + L1i L#121 (32KB) + Core L#121 + PU L#121 (P#121)
        L2 L#122 (1024KB) + L1d L#122 (32KB) + L1i L#122 (32KB) + Core L#122 + PU L#122 (P#122)
        L2 L#123 (1024KB) + L1d L#123 (32KB) + L1i L#123 (32KB) + Core L#123 + PU L#123 (P#123)
        L2 L#124 (1024KB) + L1d L#124 (32KB) + L1i L#124 (32KB) + Core L#124 + PU L#124 (P#124)
        L2 L#125 (1024KB) + L1d L#125 (32KB) + L1i L#125 (32KB) + Core L#125 + PU L#125 (P#125)
        L2 L#126 (1024KB) + L1d L#126 (32KB) + L1i L#126 (32KB) + Core L#126 + PU L#126 (P#126)
        L2 L#127 (1024KB) + L1d L#127 (32KB) + L1i L#127 (32KB) + Core L#127 + PU L#127 (P#127)
      L3 L#16 (32MB)
        L2 L#128 (1024KB) + L1d L#128 (32KB) + L1i L#128 (32KB) + Core L#128 + PU L#128 (P#128)
        L2 L#129 (1024KB) + L1d L#129 (32KB) + L1i L#129 (32KB) + Core L#129 + PU L#129 (P#129)
        L2 L#130 (1024KB) + L1d L#130 (32KB) + L1i L#130 (32KB) + Core L#130 + PU L#130 (P#130)
        L2 L#131 (1024KB) + L1d L#131 (32KB) + L1i L#131 (32KB) + Core L#131 + PU L#131 (P#131)
        L2 L#132 (1024KB) + L1d L#132 (32KB) + L1i L#132 (32KB) + Core L#132 + PU L#132 (P#132)
        L2 L#133 (1024KB) + L1d L#133 (32KB) + L1i L#133 (32KB) + Core L#133 + PU L#133 (P#133)
        L2 L#134 (1024KB) + L1d L#134 (32KB) + L1i L#134 (32KB) + Core L#134 + PU L#134 (P#134)
        L2 L#135 (1024KB) + L1d L#135 (32KB) + L1i L#135 (32KB) + Core L#135 + PU L#135 (P#135)
      L3 L#17 (32MB)
        L2 L#136 (1024KB) + L1d L#136 (32KB) + L1i L#136 (32KB) + Core L#136 + PU L#136 (P#136)
        L2 L#137 (1024KB) + L1d L#137 (32KB) + L1i L#137 (32KB) + Core L#137 + PU L#137 (P#137)
        L2 L#138 (1024KB) + L1d L#138 (32KB) + L1i L#138 (32KB) + Core L#138 + PU L#138 (P#138)
        L2 L#139 (1024KB) + L1d L#139 (32KB) + L1i L#139 (32KB) + Core L#139 + PU L#139 (P#139)
        L2 L#140 (1024KB) + L1d L#140 (32KB) + L1i L#140 (32KB) + Core L#140 + PU L#140 (P#140)
        L2 L#141 (1024KB) + L1d L#141 (32KB) + L1i L#141 (32KB) + Core L#141 + PU L#141 (P#141)
        L2 L#142 (1024KB) + L1d L#142 (32KB) + L1i L#142 (32KB) + Core L#142 + PU L#142 (P#142)
        L2 L#143 (1024KB) + L1d L#143 (32KB) + L1i L#143 (32KB) + Core L#143 + PU L#143 (P#143)
      HostBridge
        PCIBridge
          PCI 51:00.0 (SCSI)
          PCI 51:00.1 (Ethernet)
            Net "ens1f1"
          PCI 51:00.2 (SCSI)
    Group0 L#6
      NUMANode L#6 (P#6 31GB)
      L3 L#18 (32MB)
        L2 L#144 (1024KB) + L1d L#144 (32KB) + L1i L#144 (32KB) + Core L#144 + PU L#144 (P#144)
        L2 L#145 (1024KB) + L1d L#145 (32KB) + L1i L#145 (32KB) + Core L#145 + PU L#145 (P#145)
        L2 L#146 (1024KB) + L1d L#146 (32KB) + L1i L#146 (32KB) + Core L#146 + PU L#146 (P#146)
        L2 L#147 (1024KB) + L1d L#147 (32KB) + L1i L#147 (32KB) + Core L#147 + PU L#147 (P#147)
        L2 L#148 (1024KB) + L1d L#148 (32KB) + L1i L#148 (32KB) + Core L#148 + PU L#148 (P#148)
        L2 L#149 (1024KB) + L1d L#149 (32KB) + L1i L#149 (32KB) + Core L#149 + PU L#149 (P#149)
        L2 L#150 (1024KB) + L1d L#150 (32KB) + L1i L#150 (32KB) + Core L#150 + PU L#150 (P#150)
        L2 L#151 (1024KB) + L1d L#151 (32KB) + L1i L#151 (32KB) + Core L#151 + PU L#151 (P#151)
      L3 L#19 (32MB)
        L2 L#152 (1024KB) + L1d L#152 (32KB) + L1i L#152 (32KB) + Core L#152 + PU L#152 (P#152)
        L2 L#153 (1024KB) + L1d L#153 (32KB) + L1i L#153 (32KB) + Core L#153 + PU L#153 (P#153)
        L2 L#154 (1024KB) + L1d L#154 (32KB) + L1i L#154 (32KB) + Core L#154 + PU L#154 (P#154)
        L2 L#155 (1024KB) + L1d L#155 (32KB) + L1i L#155 (32KB) + Core L#155 + PU L#155 (P#155)
        L2 L#156 (1024KB) + L1d L#156 (32KB) + L1i L#156 (32KB) + Core L#156 + PU L#156 (P#156)
        L2 L#157 (1024KB) + L1d L#157 (32KB) + L1i L#157 (32KB) + Core L#157 + PU L#157 (P#157)
        L2 L#158 (1024KB) + L1d L#158 (32KB) + L1i L#158 (32KB) + Core L#158 + PU L#158 (P#158)
        L2 L#159 (1024KB) + L1d L#159 (32KB) + L1i L#159 (32KB) + Core L#159 + PU L#159 (P#159)
      L3 L#20 (32MB)
        L2 L#160 (1024KB) + L1d L#160 (32KB) + L1i L#160 (32KB) + Core L#160 + PU L#160 (P#160)
        L2 L#161 (1024KB) + L1d L#161 (32KB) + L1i L#161 (32KB) + Core L#161 + PU L#161 (P#161)
        L2 L#162 (1024KB) + L1d L#162 (32KB) + L1i L#162 (32KB) + Core L#162 + PU L#162 (P#162)
        L2 L#163 (1024KB) + L1d L#163 (32KB) + L1i L#163 (32KB) + Core L#163 + PU L#163 (P#163)
        L2 L#164 (1024KB) + L1d L#164 (32KB) + L1i L#164 (32KB) + Core L#164 + PU L#164 (P#164)
        L2 L#165 (1024KB) + L1d L#165 (32KB) + L1i L#165 (32KB) + Core L#165 + PU L#165 (P#165)
        L2 L#166 (1024KB) + L1d L#166 (32KB) + L1i L#166 (32KB) + Core L#166 + PU L#166 (P#166)
        L2 L#167 (1024KB) + L1d L#167 (32KB) + L1i L#167 (32KB) + Core L#167 + PU L#167 (P#167)
      HostBridge
        PCIBridge
          PCI 61:00.0 (NVMExp)
            Block(Disk) "nvme1n1"
    Group0 L#7
      NUMANode L#7 (P#7 31GB)
      L3 L#21 (32MB)
        L2 L#168 (1024KB) + L1d L#168 (32KB) + L1i L#168 (32KB) + Core L#168 + PU L#168 (P#168)
        L2 L#169 (1024KB) + L1d L#169 (32KB) + L1i L#169 (32KB) + Core L#169 + PU L#169 (P#169)
        L2 L#170 (1024KB) + L1d L#170 (32KB) + L1i L#170 (32KB) + Core L#170 + PU L#170 (P#170)
        L2 L#171 (1024KB) + L1d L#171 (32KB) + L1i L#171 (32KB) + Core L#171 + PU L#171 (P#171)
        L2 L#172 (1024KB) + L1d L#172 (32KB) + L1i L#172 (32KB) + Core L#172 + PU L#172 (P#172)
        L2 L#173 (1024KB) + L1d L#173 (32KB) + L1i L#173 (32KB) + Core L#173 + PU L#173 (P#173)
        L2 L#174 (1024KB) + L1d L#174 (32KB) + L1i L#174 (32KB) + Core L#174 + PU L#174 (P#174)
        L2 L#175 (1024KB) + L1d L#175 (32KB) + L1i L#175 (32KB) + Core L#175 + PU L#175 (P#175)
      L3 L#22 (32MB)
        L2 L#176 (1024KB) + L1d L#176 (32KB) + L1i L#176 (32KB) + Core L#176 + PU L#176 (P#176)
        L2 L#177 (1024KB) + L1d L#177 (32KB) + L1i L#177 (32KB) + Core L#177 + PU L#177 (P#177)
        L2 L#178 (1024KB) + L1d L#178 (32KB) + L1i L#178 (32KB) + Core L#178 + PU L#178 (P#178)
        L2 L#179 (1024KB) + L1d L#179 (32KB) + L1i L#179 (32KB) + Core L#179 + PU L#179 (P#179)
        L2 L#180 (1024KB) + L1d L#180 (32KB) + L1i L#180 (32KB) + Core L#180 + PU L#180 (P#180)
        L2 L#181 (1024KB) + L1d L#181 (32KB) + L1i L#181 (32KB) + Core L#181 + PU L#181 (P#181)
        L2 L#182 (1024KB) + L1d L#182 (32KB) + L1i L#182 (32KB) + Core L#182 + PU L#182 (P#182)
        L2 L#183 (1024KB) + L1d L#183 (32KB) + L1i L#183 (32KB) + Core L#183 + PU L#183 (P#183)
      L3 L#23 (32MB)
        L2 L#184 (1024KB) + L1d L#184 (32KB) + L1i L#184 (32KB) + Core L#184 + PU L#184 (P#184)
        L2 L#185 (1024KB) + L1d L#185 (32KB) + L1i L#185 (32KB) + Core L#185 + PU L#185 (P#185)
        L2 L#186 (1024KB) + L1d L#186 (32KB) + L1i L#186 (32KB) + Core L#186 + PU L#186 (P#186)
        L2 L#187 (1024KB) + L1d L#187 (32KB) + L1i L#187 (32KB) + Core L#187 + PU L#187 (P#187)
        L2 L#188 (1024KB) + L1d L#188 (32KB) + L1i L#188 (32KB) + Core L#188 + PU L#188 (P#188)
        L2 L#189 (1024KB) + L1d L#189 (32KB) + L1i L#189 (32KB) + Core L#189 + PU L#189 (P#189)
        L2 L#190 (1024KB) + L1d L#190 (32KB) + L1i L#190 (32KB) + Core L#190 + PU L#190 (P#190)
        L2 L#191 (1024KB) + L1d L#191 (32KB) + L1i L#191 (32KB) + Core L#191 + PU L#191 (P#191)
      HostBridge
        PCIBridge
          PCI 71:00.0 (NVMExp)
            Block(Disk) "nvme4n1"
  Package L#2
    Group0 L#8
      NUMANode L#8 (P#8 31GB)
      L3 L#24 (32MB)
        L2 L#192 (1024KB) + L1d L#192 (32KB) + L1i L#192 (32KB) + Core L#192 + PU L#192 (P#192)
        L2 L#193 (1024KB) + L1d L#193 (32KB) + L1i L#193 (32KB) + Core L#193 + PU L#193 (P#193)
        L2 L#194 (1024KB) + L1d L#194 (32KB) + L1i L#194 (32KB) + Core L#194 + PU L#194 (P#194)
        L2 L#195 (1024KB) + L1d L#195 (32KB) + L1i L#195 (32KB) + Core L#195 + PU L#195 (P#195)
        L2 L#196 (1024KB) + L1d L#196 (32KB) + L1i L#196 (32KB) + Core L#196 + PU L#196 (P#196)
        L2 L#197 (1024KB) + L1d L#197 (32KB) + L1i L#197 (32KB) + Core L#197 + PU L#197 (P#197)
        L2 L#198 (1024KB) + L1d L#198 (32KB) + L1i L#198 (32KB) + Core L#198 + PU L#198 (P#198)
        L2 L#199 (1024KB) + L1d L#199 (32KB) + L1i L#199 (32KB) + Core L#199 + PU L#199 (P#199)
      L3 L#25 (32MB)
        L2 L#200 (1024KB) + L1d L#200 (32KB) + L1i L#200 (32KB) + Core L#200 + PU L#200 (P#200)
        L2 L#201 (1024KB) + L1d L#201 (32KB) + L1i L#201 (32KB) + Core L#201 + PU L#201 (P#201)
        L2 L#202 (1024KB) + L1d L#202 (32KB) + L1i L#202 (32KB) + Core L#202 + PU L#202 (P#202)
        L2 L#203 (1024KB) + L1d L#203 (32KB) + L1i L#203 (32KB) + Core L#203 + PU L#203 (P#203)
        L2 L#204 (1024KB) + L1d L#204 (32KB) + L1i L#204 (32KB) + Core L#204 + PU L#204 (P#204)
        L2 L#205 (1024KB) + L1d L#205 (32KB) + L1i L#205 (32KB) + Core L#205 + PU L#205 (P#205)
        L2 L#206 (1024KB) + L1d L#206 (32KB) + L1i L#206 (32KB) + Core L#206 + PU L#206 (P#206)
        L2 L#207 (1024KB) + L1d L#207 (32KB) + L1i L#207 (32KB) + Core L#207 + PU L#207 (P#207)
      L3 L#26 (32MB)
        L2 L#208 (1024KB) + L1d L#208 (32KB) + L1i L#208 (32KB) + Core L#208 + PU L#208 (P#208)
        L2 L#209 (1024KB) + L1d L#209 (32KB) + L1i L#209 (32KB) + Core L#209 + PU L#209 (P#209)
        L2 L#210 (1024KB) + L1d L#210 (32KB) + L1i L#210 (32KB) + Core L#210 + PU L#210 (P#210)
        L2 L#211 (1024KB) + L1d L#211 (32KB) + L1i L#211 (32KB) + Core L#211 + PU L#211 (P#211)
        L2 L#212 (1024KB) + L1d L#212 (32KB) + L1i L#212 (32KB) + Core L#212 + PU L#212 (P#212)
        L2 L#213 (1024KB) + L1d L#213 (32KB) + L1i L#213 (32KB) + Core L#213 + PU L#213 (P#213)
        L2 L#214 (1024KB) + L1d L#214 (32KB) + L1i L#214 (32KB) + Core L#214 + PU L#214 (P#214)
        L2 L#215 (1024KB) + L1d L#215 (32KB) + L1i L#215 (32KB) + Core L#215 + PU L#215 (P#215)
      HostBridge
        PCIBridge
          PCI 81:00.0 (InfiniBand)
            Net "ibp129s0"
            OpenFabrics "mlx5_1"
    Group0 L#9
      NUMANode L#9 (P#9 31GB)
      L3 L#27 (32MB)
        L2 L#216 (1024KB) + L1d L#216 (32KB) + L1i L#216 (32KB) + Core L#216 + PU L#216 (P#216)
        L2 L#217 (1024KB) + L1d L#217 (32KB) + L1i L#217 (32KB) + Core L#217 + PU L#217 (P#217)
        L2 L#218 (1024KB) + L1d L#218 (32KB) + L1i L#218 (32KB) + Core L#218 + PU L#218 (P#218)
        L2 L#219 (1024KB) + L1d L#219 (32KB) + L1i L#219 (32KB) + Core L#219 + PU L#219 (P#219)
        L2 L#220 (1024KB) + L1d L#220 (32KB) + L1i L#220 (32KB) + Core L#220 + PU L#220 (P#220)
        L2 L#221 (1024KB) + L1d L#221 (32KB) + L1i L#221 (32KB) + Core L#221 + PU L#221 (P#221)
        L2 L#222 (1024KB) + L1d L#222 (32KB) + L1i L#222 (32KB) + Core L#222 + PU L#222 (P#222)
        L2 L#223 (1024KB) + L1d L#223 (32KB) + L1i L#223 (32KB) + Core L#223 + PU L#223 (P#223)
      L3 L#28 (32MB)
        L2 L#224 (1024KB) + L1d L#224 (32KB) + L1i L#224 (32KB) + Core L#224 + PU L#224 (P#224)
        L2 L#225 (1024KB) + L1d L#225 (32KB) + L1i L#225 (32KB) + Core L#225 + PU L#225 (P#225)
        L2 L#226 (1024KB) + L1d L#226 (32KB) + L1i L#226 (32KB) + Core L#226 + PU L#226 (P#226)
        L2 L#227 (1024KB) + L1d L#227 (32KB) + L1i L#227 (32KB) + Core L#227 + PU L#227 (P#227)
        L2 L#228 (1024KB) + L1d L#228 (32KB) + L1i L#228 (32KB) + Core L#228 + PU L#228 (P#228)
        L2 L#229 (1024KB) + L1d L#229 (32KB) + L1i L#229 (32KB) + Core L#229 + PU L#229 (P#229)
        L2 L#230 (1024KB) + L1d L#230 (32KB) + L1i L#230 (32KB) + Core L#230 + PU L#230 (P#230)
        L2 L#231 (1024KB) + L1d L#231 (32KB) + L1i L#231 (32KB) + Core L#231 + PU L#231 (P#231)
      L3 L#29 (32MB)
        L2 L#232 (1024KB) + L1d L#232 (32KB) + L1i L#232 (32KB) + Core L#232 + PU L#232 (P#232)
        L2 L#233 (1024KB) + L1d L#233 (32KB) + L1i L#233 (32KB) + Core L#233 + PU L#233 (P#233)
        L2 L#234 (1024KB) + L1d L#234 (32KB) + L1i L#234 (32KB) + Core L#234 + PU L#234 (P#234)
        L2 L#235 (1024KB) + L1d L#235 (32KB) + L1i L#235 (32KB) + Core L#235 + PU L#235 (P#235)
        L2 L#236 (1024KB) + L1d L#236 (32KB) + L1i L#236 (32KB) + Core L#236 + PU L#236 (P#236)
        L2 L#237 (1024KB) + L1d L#237 (32KB) + L1i L#237 (32KB) + Core L#237 + PU L#237 (P#237)
        L2 L#238 (1024KB) + L1d L#238 (32KB) + L1i L#238 (32KB) + Core L#238 + PU L#238 (P#238)
        L2 L#239 (1024KB) + L1d L#239 (32KB) + L1i L#239 (32KB) + Core L#239 + PU L#239 (P#239)
    Group0 L#10
      NUMANode L#10 (P#10 31GB)
      L3 L#30 (32MB)
        L2 L#240 (1024KB) + L1d L#240 (32KB) + L1i L#240 (32KB) + Core L#240 + PU L#240 (P#240)
        L2 L#241 (1024KB) + L1d L#241 (32KB) + L1i L#241 (32KB) + Core L#241 + PU L#241 (P#241)
        L2 L#242 (1024KB) + L1d L#242 (32KB) + L1i L#242 (32KB) + Core L#242 + PU L#242 (P#242)
        L2 L#243 (1024KB) + L1d L#243 (32KB) + L1i L#243 (32KB) + Core L#243 + PU L#243 (P#243)
        L2 L#244 (1024KB) + L1d L#244 (32KB) + L1i L#244 (32KB) + Core L#244 + PU L#244 (P#244)
        L2 L#245 (1024KB) + L1d L#245 (32KB) + L1i L#245 (32KB) + Core L#245 + PU L#245 (P#245)
        L2 L#246 (1024KB) + L1d L#246 (32KB) + L1i L#246 (32KB) + Core L#246 + PU L#246 (P#246)
        L2 L#247 (1024KB) + L1d L#247 (32KB) + L1i L#247 (32KB) + Core L#247 + PU L#247 (P#247)
      L3 L#31 (32MB)
        L2 L#248 (1024KB) + L1d L#248 (32KB) + L1i L#248 (32KB) + Core L#248 + PU L#248 (P#248)
        L2 L#249 (1024KB) + L1d L#249 (32KB) + L1i L#249 (32KB) + Core L#249 + PU L#249 (P#249)
        L2 L#250 (1024KB) + L1d L#250 (32KB) + L1i L#250 (32KB) + Core L#250 + PU L#250 (P#250)
        L2 L#251 (1024KB) + L1d L#251 (32KB) + L1i L#251 (32KB) + Core L#251 + PU L#251 (P#251)
        L2 L#252 (1024KB) + L1d L#252 (32KB) + L1i L#252 (32KB) + Core L#252 + PU L#252 (P#252)
        L2 L#253 (1024KB) + L1d L#253 (32KB) + L1i L#253 (32KB) + Core L#253 + PU L#253 (P#253)
        L2 L#254 (1024KB) + L1d L#254 (32KB) + L1i L#254 (32KB) + Core L#254 + PU L#254 (P#254)
        L2 L#255 (1024KB) + L1d L#255 (32KB) + L1i L#255 (32KB) + Core L#255 + PU L#255 (P#255)
      L3 L#32 (32MB)
        L2 L#256 (1024KB) + L1d L#256 (32KB) + L1i L#256 (32KB) + Core L#256 + PU L#256 (P#256)
        L2 L#257 (1024KB) + L1d L#257 (32KB) + L1i L#257 (32KB) + Core L#257 + PU L#257 (P#257)
        L2 L#258 (1024KB) + L1d L#258 (32KB) + L1i L#258 (32KB) + Core L#258 + PU L#258 (P#258)
        L2 L#259 (1024KB) + L1d L#259 (32KB) + L1i L#259 (32KB) + Core L#259 + PU L#259 (P#259)
        L2 L#260 (1024KB) + L1d L#260 (32KB) + L1i L#260 (32KB) + Core L#260 + PU L#260 (P#260)
        L2 L#261 (1024KB) + L1d L#261 (32KB) + L1i L#261 (32KB) + Core L#261 + PU L#261 (P#261)
        L2 L#262 (1024KB) + L1d L#262 (32KB) + L1i L#262 (32KB) + Core L#262 + PU L#262 (P#262)
        L2 L#263 (1024KB) + L1d L#263 (32KB) + L1i L#263 (32KB) + Core L#263 + PU L#263 (P#263)
      HostBridge
        PCIBridge
          PCI a1:00.0 (NVMExp)
            Block(Disk) "nvme3n1"
    Group0 L#11
      NUMANode L#11 (P#11 31GB)
      L3 L#33 (32MB)
        L2 L#264 (1024KB) + L1d L#264 (32KB) + L1i L#264 (32KB) + Core L#264 + PU L#264 (P#264)
        L2 L#265 (1024KB) + L1d L#265 (32KB) + L1i L#265 (32KB) + Core L#265 + PU L#265 (P#265)
        L2 L#266 (1024KB) + L1d L#266 (32KB) + L1i L#266 (32KB) + Core L#266 + PU L#266 (P#266)
        L2 L#267 (1024KB) + L1d L#267 (32KB) + L1i L#267 (32KB) + Core L#267 + PU L#267 (P#267)
        L2 L#268 (1024KB) + L1d L#268 (32KB) + L1i L#268 (32KB) + Core L#268 + PU L#268 (P#268)
        L2 L#269 (1024KB) + L1d L#269 (32KB) + L1i L#269 (32KB) + Core L#269 + PU L#269 (P#269)
        L2 L#270 (1024KB) + L1d L#270 (32KB) + L1i L#270 (32KB) + Core L#270 + PU L#270 (P#270)
        L2 L#271 (1024KB) + L1d L#271 (32KB) + L1i L#271 (32KB) + Core L#271 + PU L#271 (P#271)
      L3 L#34 (32MB)
        L2 L#272 (1024KB) + L1d L#272 (32KB) + L1i L#272 (32KB) + Core L#272 + PU L#272 (P#272)
        L2 L#273 (1024KB) + L1d L#273 (32KB) + L1i L#273 (32KB) + Core L#273 + PU L#273 (P#273)
        L2 L#274 (1024KB) + L1d L#274 (32KB) + L1i L#274 (32KB) + Core L#274 + PU L#274 (P#274)
        L2 L#275 (1024KB) + L1d L#275 (32KB) + L1i L#275 (32KB) + Core L#275 + PU L#275 (P#275)
        L2 L#276 (1024KB) + L1d L#276 (32KB) + L1i L#276 (32KB) + Core L#276 + PU L#276 (P#276)
        L2 L#277 (1024KB) + L1d L#277 (32KB) + L1i L#277 (32KB) + Core L#277 + PU L#277 (P#277)
        L2 L#278 (1024KB) + L1d L#278 (32KB) + L1i L#278 (32KB) + Core L#278 + PU L#278 (P#278)
        L2 L#279 (1024KB) + L1d L#279 (32KB) + L1i L#279 (32KB) + Core L#279 + PU L#279 (P#279)
      L3 L#35 (32MB)
        L2 L#280 (1024KB) + L1d L#280 (32KB) + L1i L#280 (32KB) + Core L#280 + PU L#280 (P#280)
        L2 L#281 (1024KB) + L1d L#281 (32KB) + L1i L#281 (32KB) + Core L#281 + PU L#281 (P#281)
        L2 L#282 (1024KB) + L1d L#282 (32KB) + L1i L#282 (32KB) + Core L#282 + PU L#282 (P#282)
        L2 L#283 (1024KB) + L1d L#283 (32KB) + L1i L#283 (32KB) + Core L#283 + PU L#283 (P#283)
        L2 L#284 (1024KB) + L1d L#284 (32KB) + L1i L#284 (32KB) + Core L#284 + PU L#284 (P#284)
        L2 L#285 (1024KB) + L1d L#285 (32KB) + L1i L#285 (32KB) + Core L#285 + PU L#285 (P#285)
        L2 L#286 (1024KB) + L1d L#286 (32KB) + L1i L#286 (32KB) + Core L#286 + PU L#286 (P#286)
        L2 L#287 (1024KB) + L1d L#287 (32KB) + L1i L#287 (32KB) + Core L#287 + PU L#287 (P#287)
      HostBridge
        PCIBridge
          PCI b1:00.0 (NVMExp)
            Block(Disk) "nvme6n1"
  Package L#3
    Group0 L#12
      NUMANode L#12 (P#12 31GB)
      L3 L#36 (32MB)
        L2 L#288 (1024KB) + L1d L#288 (32KB) + L1i L#288 (32KB) + Core L#288 + PU L#288 (P#288)
        L2 L#289 (1024KB) + L1d L#289 (32KB) + L1i L#289 (32KB) + Core L#289 + PU L#289 (P#289)
        L2 L#290 (1024KB) + L1d L#290 (32KB) + L1i L#290 (32KB) + Core L#290 + PU L#290 (P#290)
        L2 L#291 (1024KB) + L1d L#291 (32KB) + L1i L#291 (32KB) + Core L#291 + PU L#291 (P#291)
        L2 L#292 (1024KB) + L1d L#292 (32KB) + L1i L#292 (32KB) + Core L#292 + PU L#292 (P#292)
        L2 L#293 (1024KB) + L1d L#293 (32KB) + L1i L#293 (32KB) + Core L#293 + PU L#293 (P#293)
        L2 L#294 (1024KB) + L1d L#294 (32KB) + L1i L#294 (32KB) + Core L#294 + PU L#294 (P#294)
        L2 L#295 (1024KB) + L1d L#295 (32KB) + L1i L#295 (32KB) + Core L#295 + PU L#295 (P#295)
      L3 L#37 (32MB)
        L2 L#296 (1024KB) + L1d L#296 (32KB) + L1i L#296 (32KB) + Core L#296 + PU L#296 (P#296)
        L2 L#297 (1024KB) + L1d L#297 (32KB) + L1i L#297 (32KB) + Core L#297 + PU L#297 (P#297)
        L2 L#298 (1024KB) + L1d L#298 (32KB) + L1i L#298 (32KB) + Core L#298 + PU L#298 (P#298)
        L2 L#299 (1024KB) + L1d L#299 (32KB) + L1i L#299 (32KB) + Core L#299 + PU L#299 (P#299)
        L2 L#300 (1024KB) + L1d L#300 (32KB) + L1i L#300 (32KB) + Core L#300 + PU L#300 (P#300)
        L2 L#301 (1024KB) + L1d L#301 (32KB) + L1i L#301 (32KB) + Core L#301 + PU L#301 (P#301)
        L2 L#302 (1024KB) + L1d L#302 (32KB) + L1i L#302 (32KB) + Core L#302 + PU L#302 (P#302)
        L2 L#303 (1024KB) + L1d L#303 (32KB) + L1i L#303 (32KB) + Core L#303 + PU L#303 (P#303)
      L3 L#38 (32MB)
        L2 L#304 (1024KB) + L1d L#304 (32KB) + L1i L#304 (32KB) + Core L#304 + PU L#304 (P#304)
        L2 L#305 (1024KB) + L1d L#305 (32KB) + L1i L#305 (32KB) + Core L#305 + PU L#305 (P#305)
        L2 L#306 (1024KB) + L1d L#306 (32KB) + L1i L#306 (32KB) + Core L#306 + PU L#306 (P#306)
        L2 L#307 (1024KB) + L1d L#307 (32KB) + L1i L#307 (32KB) + Core L#307 + PU L#307 (P#307)
        L2 L#308 (1024KB) + L1d L#308 (32KB) + L1i L#308 (32KB) + Core L#308 + PU L#308 (P#308)
        L2 L#309 (1024KB) + L1d L#309 (32KB) + L1i L#309 (32KB) + Core L#309 + PU L#309 (P#309)
        L2 L#310 (1024KB) + L1d L#310 (32KB) + L1i L#310 (32KB) + Core L#310 + PU L#310 (P#310)
        L2 L#311 (1024KB) + L1d L#311 (32KB) + L1i L#311 (32KB) + Core L#311 + PU L#311 (P#311)
      HostBridge
        PCIBridge
          PCI c1:00.0 (InfiniBand)
            Net "ibp193s0"
            OpenFabrics "mlx5_2"
    Group0 L#13
      NUMANode L#13 (P#13 31GB)
      L3 L#39 (32MB)
        L2 L#312 (1024KB) + L1d L#312 (32KB) + L1i L#312 (32KB) + Core L#312 + PU L#312 (P#312)
        L2 L#313 (1024KB) + L1d L#313 (32KB) + L1i L#313 (32KB) + Core L#313 + PU L#313 (P#313)
        L2 L#314 (1024KB) + L1d L#314 (32KB) + L1i L#314 (32KB) + Core L#314 + PU L#314 (P#314)
        L2 L#315 (1024KB) + L1d L#315 (32KB) + L1i L#315 (32KB) + Core L#315 + PU L#315 (P#315)
        L2 L#316 (1024KB) + L1d L#316 (32KB) + L1i L#316 (32KB) + Core L#316 + PU L#316 (P#316)
        L2 L#317 (1024KB) + L1d L#317 (32KB) + L1i L#317 (32KB) + Core L#317 + PU L#317 (P#317)
        L2 L#318 (1024KB) + L1d L#318 (32KB) + L1i L#318 (32KB) + Core L#318 + PU L#318 (P#318)
        L2 L#319 (1024KB) + L1d L#319 (32KB) + L1i L#319 (32KB) + Core L#319 + PU L#319 (P#319)
      L3 L#40 (32MB)
        L2 L#320 (1024KB) + L1d L#320 (32KB) + L1i L#320 (32KB) + Core L#320 + PU L#320 (P#320)
        L2 L#321 (1024KB) + L1d L#321 (32KB) + L1i L#321 (32KB) + Core L#321 + PU L#321 (P#321)
        L2 L#322 (1024KB) + L1d L#322 (32KB) + L1i L#322 (32KB) + Core L#322 + PU L#322 (P#322)
        L2 L#323 (1024KB) + L1d L#323 (32KB) + L1i L#323 (32KB) + Core L#323 + PU L#323 (P#323)
        L2 L#324 (1024KB) + L1d L#324 (32KB) + L1i L#324 (32KB) + Core L#324 + PU L#324 (P#324)
        L2 L#325 (1024KB) + L1d L#325 (32KB) + L1i L#325 (32KB) + Core L#325 + PU L#325 (P#325)
        L2 L#326 (1024KB) + L1d L#326 (32KB) + L1i L#326 (32KB) + Core L#326 + PU L#326 (P#326)
        L2 L#327 (1024KB) + L1d L#327 (32KB) + L1i L#327 (32KB) + Core L#327 + PU L#327 (P#327)
      L3 L#41 (32MB)
        L2 L#328 (1024KB) + L1d L#328 (32KB) + L1i L#328 (32KB) + Core L#328 + PU L#328 (P#328)
        L2 L#329 (1024KB) + L1d L#329 (32KB) + L1i L#329 (32KB) + Core L#329 + PU L#329 (P#329)
        L2 L#330 (1024KB) + L1d L#330 (32KB) + L1i L#330 (32KB) + Core L#330 + PU L#330 (P#330)
        L2 L#331 (1024KB) + L1d L#331 (32KB) + L1i L#331 (32KB) + Core L#331 + PU L#331 (P#331)
        L2 L#332 (1024KB) + L1d L#332 (32KB) + L1i L#332 (32KB) + Core L#332 + PU L#332 (P#332)
        L2 L#333 (1024KB) + L1d L#333 (32KB) + L1i L#333 (32KB) + Core L#333 + PU L#333 (P#333)
        L2 L#334 (1024KB) + L1d L#334 (32KB) + L1i L#334 (32KB) + Core L#334 + PU L#334 (P#334)
        L2 L#335 (1024KB) + L1d L#335 (32KB) + L1i L#335 (32KB) + Core L#335 + PU L#335 (P#335)
    Group0 L#14
      NUMANode L#14 (P#14 31GB)
      L3 L#42 (32MB)
        L2 L#336 (1024KB) + L1d L#336 (32KB) + L1i L#336 (32KB) + Core L#336 + PU L#336 (P#336)
        L2 L#337 (1024KB) + L1d L#337 (32KB) + L1i L#337 (32KB) + Core L#337 + PU L#337 (P#337)
        L2 L#338 (1024KB) + L1d L#338 (32KB) + L1i L#338 (32KB) + Core L#338 + PU L#338 (P#338)
        L2 L#339 (1024KB) + L1d L#339 (32KB) + L1i L#339 (32KB) + Core L#339 + PU L#339 (P#339)
        L2 L#340 (1024KB) + L1d L#340 (32KB) + L1i L#340 (32KB) + Core L#340 + PU L#340 (P#340)
        L2 L#341 (1024KB) + L1d L#341 (32KB) + L1i L#341 (32KB) + Core L#341 + PU L#341 (P#341)
        L2 L#342 (1024KB) + L1d L#342 (32KB) + L1i L#342 (32KB) + Core L#342 + PU L#342 (P#342)
        L2 L#343 (1024KB) + L1d L#343 (32KB) + L1i L#343 (32KB) + Core L#343 + PU L#343 (P#343)
      L3 L#43 (32MB)
        L2 L#344 (1024KB) + L1d L#344 (32KB) + L1i L#344 (32KB) + Core L#344 + PU L#344 (P#344)
        L2 L#345 (1024KB) + L1d L#345 (32KB) + L1i L#345 (32KB) + Core L#345 + PU L#345 (P#345)
        L2 L#346 (1024KB) + L1d L#346 (32KB) + L1i L#346 (32KB) + Core L#346 + PU L#346 (P#346)
        L2 L#347 (1024KB) + L1d L#347 (32KB) + L1i L#347 (32KB) + Core L#347 + PU L#347 (P#347)
        L2 L#348 (1024KB) + L1d L#348 (32KB) + L1i L#348 (32KB) + Core L#348 + PU L#348 (P#348)
        L2 L#349 (1024KB) + L1d L#349 (32KB) + L1i L#349 (32KB) + Core L#349 + PU L#349 (P#349)
        L2 L#350 (1024KB) + L1d L#350 (32KB) + L1i L#350 (32KB) + Core L#350 + PU L#350 (P#350)
        L2 L#351 (1024KB) + L1d L#351 (32KB) + L1i L#351 (32KB) + Core L#351 + PU L#351 (P#351)
      L3 L#44 (32MB)
        L2 L#352 (1024KB) + L1d L#352 (32KB) + L1i L#352 (32KB) + Core L#352 + PU L#352 (P#352)
        L2 L#353 (1024KB) + L1d L#353 (32KB) + L1i L#353 (32KB) + Core L#353 + PU L#353 (P#353)
        L2 L#354 (1024KB) + L1d L#354 (32KB) + L1i L#354 (32KB) + Core L#354 + PU L#354 (P#354)
        L2 L#355 (1024KB) + L1d L#355 (32KB) + L1i L#355 (32KB) + Core L#355 + PU L#355 (P#355)
        L2 L#356 (1024KB) + L1d L#356 (32KB) + L1i L#356 (32KB) + Core L#356 + PU L#356 (P#356)
        L2 L#357 (1024KB) + L1d L#357 (32KB) + L1i L#357 (32KB) + Core L#357 + PU L#357 (P#357)
        L2 L#358 (1024KB) + L1d L#358 (32KB) + L1i L#358 (32KB) + Core L#358 + PU L#358 (P#358)
        L2 L#359 (1024KB) + L1d L#359 (32KB) + L1i L#359 (32KB) + Core L#359 + PU L#359 (P#359)
      HostBridge
        PCIBridge
          PCI e1:00.0 (NVMExp)
            Block(Disk) "nvme9n1"
    Group0 L#15
      NUMANode L#15 (P#15 31GB)
      L3 L#45 (32MB)
        L2 L#360 (1024KB) + L1d L#360 (32KB) + L1i L#360 (32KB) + Core L#360 + PU L#360 (P#360)
        L2 L#361 (1024KB) + L1d L#361 (32KB) + L1i L#361 (32KB) + Core L#361 + PU L#361 (P#361)
        L2 L#362 (1024KB) + L1d L#362 (32KB) + L1i L#362 (32KB) + Core L#362 + PU L#362 (P#362)
        L2 L#363 (1024KB) + L1d L#363 (32KB) + L1i L#363 (32KB) + Core L#363 + PU L#363 (P#363)
        L2 L#364 (1024KB) + L1d L#364 (32KB) + L1i L#364 (32KB) + Core L#364 + PU L#364 (P#364)
        L2 L#365 (1024KB) + L1d L#365 (32KB) + L1i L#365 (32KB) + Core L#365 + PU L#365 (P#365)
        L2 L#366 (1024KB) + L1d L#366 (32KB) + L1i L#366 (32KB) + Core L#366 + PU L#366 (P#366)
        L2 L#367 (1024KB) + L1d L#367 (32KB) + L1i L#367 (32KB) + Core L#367 + PU L#367 (P#367)
      L3 L#46 (32MB)
        L2 L#368 (1024KB) + L1d L#368 (32KB) + L1i L#368 (32KB) + Core L#368 + PU L#368 (P#368)
        L2 L#369 (1024KB) + L1d L#369 (32KB) + L1i L#369 (32KB) + Core L#369 + PU L#369 (P#369)
        L2 L#370 (1024KB) + L1d L#370 (32KB) + L1i L#370 (32KB) + Core L#370 + PU L#370 (P#370)
        L2 L#371 (1024KB) + L1d L#371 (32KB) + L1i L#371 (32KB) + Core L#371 + PU L#371 (P#371)
        L2 L#372 (1024KB) + L1d L#372 (32KB) + L1i L#372 (32KB) + Core L#372 + PU L#372 (P#372)
        L2 L#373 (1024KB) + L1d L#373 (32KB) + L1i L#373 (32KB) + Core L#373 + PU L#373 (P#373)
        L2 L#374 (1024KB) + L1d L#374 (32KB) + L1i L#374 (32KB) + Core L#374 + PU L#374 (P#374)
        L2 L#375 (1024KB) + L1d L#375 (32KB) + L1i L#375 (32KB) + Core L#375 + PU L#375 (P#375)
      L3 L#47 (32MB)
        L2 L#376 (1024KB) + L1d L#376 (32KB) + L1i L#376 (32KB) + Core L#376 + PU L#376 (P#376)
        L2 L#377 (1024KB) + L1d L#377 (32KB) + L1i L#377 (32KB) + Core L#377 + PU L#377 (P#377)
        L2 L#378 (1024KB) + L1d L#378 (32KB) + L1i L#378 (32KB) + Core L#378 + PU L#378 (P#378)
        L2 L#379 (1024KB) + L1d L#379 (32KB) + L1i L#379 (32KB) + Core L#379 + PU L#379 (P#379)
        L2 L#380 (1024KB) + L1d L#380 (32KB) + L1i L#380 (32KB) + Core L#380 + PU L#380 (P#380)
        L2 L#381 (1024KB) + L1d L#381 (32KB) + L1i L#381 (32KB) + Core L#381 + PU L#381 (P#381)
        L2 L#382 (1024KB) + L1d L#382 (32KB) + L1i L#382 (32KB) + Core L#382 + PU L#382 (P#382)
        L2 L#383 (1024KB) + L1d L#383 (32KB) + L1i L#383 (32KB) + Core L#383 + PU L#383 (P#383)
      HostBridge
        PCIBridge
          PCI f1:00.0 (NVMExp)
            Block(Disk) "nvme7n1"

Unfortunately I can not share the cpu specs publicly. However this is a CPU with regular x86 based cores.

We were wondering if there is a way to limit ucx to use only the nearest NIC without spanning over several nics, without using extra environment variables that might not be possible to use in general.

Can you please take a look at the UCX logs (shared at the top of this report) and let us know if our suspicion, that more than one NIC is being used for transfers, is correct?

arstgr avatar May 22 '25 16:05 arstgr

Hi @arun-chandran-edarath

Thank you very much for the suggestion. Our workloads mostly fully populate the entire node (in this case 384 ranks per node) so this didn't help us (i.e. using a new build of ucx and setting UCX_NT_BUFFER_TRANSFER_MIN=0), however it is an excellent work.

arstgr avatar May 22 '25 23:05 arstgr

@arstgr Thank you so much for trying it out, yes, in its current form, NT_BUFFER_TRANSFER is set up to help hybrid MPI workloads (1 rank in L3)

arun-chandran-edarath avatar May 26 '25 04:05 arun-chandran-edarath

@arstgr Thank you so much for trying it out, yes, in its current form, NT_BUFFER_TRANSFER is set up to help hybrid MPI workloads (1 rank in L3)

I missed one important point: if the buffer size being transferred is more than three-fourths of the L3 cache size, NT_BUFFER_TRANSFER should also help full-rank MPI workloads.

arun-chandran-edarath avatar May 27 '25 03:05 arun-chandran-edarath

Hi @arstgr,

We were wondering if there is a way to limit ucx to use only the nearest NIC without spanning over several nics, without using extra environment variables that might not be possible to use in general.

The main method of applying WA in UCX is by setting env vars to change default behaviour. If that's not an option, then we'll need to further investigate this issue, and if needed, provide a fix in next release.

Can you please take a look at the UCX logs (shared at the top of this report) and let us know if our suspicion, that more than one NIC is being used for transfers, is correct?

That's correct, the number of NICs used for transfer is determined by UCX_MAX_RNDV_RAILS, which has a default value of 2 for NDR.

If you wish to further investigate, please send the MPI command line + output results, so we can better understand the root issue.

shasson5 avatar May 28 '25 09:05 shasson5

Hi @shasson5

Thanks for looking into this issue. I think the default behavior for UCX_MAX_RNDV_RAILS should be to either use multiple virtual lanes within the same adapter or multiple physical adapters if they are on the same package. When this is not the case (i.e. multiple physical adapters on different packages are used) there is always a performance hit, as is the case for our current test environment.

The MPI command line to reproduce this, along with the output results and the ucx log files are listed at the top of this bug report.

arstgr avatar Jun 03 '25 19:06 arstgr

Hi @arstgr

There shouldn't be any performance hit as a result of using NICs on remote package (as long as local NICs are prioritized).

The results you listed above are for osu_bibw. According to the output you get 90MB/s when running with default params (multiple NICs), and only 50MB/s when running with 1 NIC.

So I cannot understand where exactly you see a degradation. Am I missing something?

shasson5 avatar Jun 26 '25 16:06 shasson5

Hi @arstgr we need your response, if the issue is still relevant.

gleon99 avatar Jul 06 '25 08:07 gleon99

No customer response, closing.

gleon99 avatar Jul 16 '25 09:07 gleon99