Under UCS_THREAD_MODE_SINGLE, pthread_spin_lock experiences severe contention.
Describe the bug
We have created 8 workers, each configured with UCS_THREAD_MODE_SINGLE and pinned to one of 8 CPU cores for polling, but found that the performance fails to scale.
All worker shared same ucp context.
Through perf analysis, we generated the following flame graph, which reveals that pthread_spin_lock is experiencing extremely severe contention.
We are not using memh but rather rcache, and we suspect this may be due to the fact that rcache requires locking, leading to the issue.
For detailed svg files, please refer to the attachment.
Steps to Reproduce
- Command line
We developed the programa based on ucp api.
- Compile Config
"--disable-doxygen-doc",
"--without-go",
"--without-java",
"--without-rte",
"--without-fuse3",
"--without-gdrcopy",
"--without-knem",
"--without-xpmem",
"--without-ugni",
"--enable-frame-pointer",
"--enable-mt",
- UCX version used: ucx 1.18.0
- Any UCX environment variables used
client:
export UCX_TLS=rc,tcp
export UCX_IB_ROCE_LOCAL_SUBNET=y
server:
export UCX_CM_REUSEADDR=y
export UCX_IB_ROCE_LOCAL_SUBNET=y
export UCX_TLS=rc,tcp
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
Linux H04-100G-ASW017-M01 5.14.0-162.nos.4.el8.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov 24 07:51:00 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- For RDMA/IB/RoCE related issues:
rdma-core-58mlnx43-1.58415.x86_64
- HW information from
ibstatoribv_devinfo -vvcommand
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 28.39.1002
node_guid: 58a2:e103:00da:90b6
sys_image_guid: 58a2:e103:00da:90b6
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000834
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 131072
max_qp_wr: 32768
device_cap_flags: 0xed721c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
RAW_IP_CSUM
MANAGED_FLOW_STEERING
Unknown flags: 0xC8400000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 8388608
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 2097152
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
MANAGED_FLOW_STEERING
Unknown flags: 0xC8400000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 8388608
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 2097152
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
ODP_SUPPORT_IMPLICIT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_ATOMIC
SUPPORT_SRQ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
xrc_odp_caps:
SUPPORT_SEND
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_ATOMIC
SUPPORT_SRQ
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 1000000kHZ
raw packet caps:
C-VLAN stripping offload
Scatter FCS offload
IP csum offload
Delay drop
device_cap_flags_ex: 0x30000054ED721C36
RAW_SCATTER_FCS
PCI_WRITE_END_PADDING
Unknown flags: 0x3000004000000000
tso_caps:
max_tso: 262144
supported_qp:
SUPPORT_RAW_PACKET
rss_caps:
max_rwq_indirection_tables: 524288
max_rwq_indirection_table_size: 2048
rx_hash_function: 0x1
rx_hash_fields_mask: 0x800000FF
supported_qp:
SUPPORT_RAW_PACKET
max_wq_type_rq: 8388608
packet_pacing_caps:
qp_rate_limit_min: 1kbps
qp_rate_limit_max: 200000000kbps
supported_qp:
SUPPORT_RAW_PACKET
tag matching not supported
cq moderation caps:
max_cq_count: 65535
max_cq_period: 4095 us
maximum available device memory: 131072Bytes
num_comp_vectors: 63
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
max_msg_sz: 0x40000000
port_cap_flags: 0x04010000
port_cap_flags2: 0x0000
max_vl_num: invalid value (0)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 1
gid_tbl_len: 255
subnet_timeout: 0
init_type_reply: 0
active_width: 4X (2)
active_speed: 50.0 Gbps (64)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:5aa2:e1ff:feda:7ec6, RoCE v1
GID[ 1]: fe80::5aa2:e1ff:feda:7ec6, RoCE v2
GID[ 2]: 0000:0000:0000:0000:0000:ffff:0a0a:1a67, RoCE v1
GID[ 3]: ::ffff:10.10.26.103, RoCE v2
Additional information (depending on the issue)
- OpenMPI version
- Output of
ucx_info -dto show transports and devices recognized by UCX
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# memory types: host (access,reg,cache)
#
# Transport: self
# Device: memory
# Type: loopback
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# memory types: host (access,reg,cache)
#
# Transport: tcp
# Device: public
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 45265.43/ppn + 0.00 MB/sec
# latency: 5201 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: ens65f0
# Type: network
# System device: ens65f0 (0)
#
# capabilities:
# bandwidth: 113.16/ppn + 0.00 MB/sec
# latency: 5776 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: private
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 45265.43/ppn + 0.00 MB/sec
# latency: 5201 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: sysv
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: posix
# Component: posix
# allocate: <= 263549516K
# remote key: 24 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: posix
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: mlx5_bond_0
# Component: ib
# register: unlimited, dmabuf, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache)
#
# Transport: dc_mlx5
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (1)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 860 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: rc_verbs
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (1)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 18 bytes
# ep address: 7 bytes
# error handling: peer failure, ep_check
#
#
# Transport: rc_mlx5
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (1)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 18 bytes
# ep address: 10 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (1)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 920
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (1)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 132
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_bond_1
# Component: ib
# register: unlimited, dmabuf, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache)
#
# Transport: dc_mlx5
# Device: mlx5_bond_1:1
# Type: network
# System device: mlx5_bond_1 (2)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 860 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: rc_verbs
# Device: mlx5_bond_1:1
# Type: network
# System device: mlx5_bond_1 (2)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 18 bytes
# ep address: 7 bytes
# error handling: peer failure, ep_check
#
#
# Transport: rc_mlx5
# Device: mlx5_bond_1:1
# Type: network
# System device: mlx5_bond_1 (2)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 70
# device num paths: 2
# max eps: 256
# device address: 18 bytes
# ep address: 10 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_bond_1:1
# Type: network
# System device: mlx5_bond_1 (2)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 920
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_bond_1:1
# Type: network
# System device: mlx5_bond_1 (2)
#
# capabilities:
# bandwidth: 43831.35/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 132
# connection: to ep, to iface
# device priority: 70
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
# max_conn_priv: 54 bytes
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
# memory types: host (access,reg,cache)
#
# Transport: cma
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: knem
# Component: knem
# register: unlimited, cost: 180 nsec
# remote key: 16 bytes
# memory types: host (access,reg,cache)
#
# Transport: knem
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 13862.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 0 bytes
# error handling: none
#
#
# Memory domain: xpmem
# Component: xpmem
# register: unlimited, cost: 60 nsec
# remote key: 24 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,reg,cache)
#
# Transport: xpmem
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
[
](url)
Hi @ivanallen, Have you been able to try using v1.20.x in case this issue was already addressed in newer versions?
Hi @ivanallen, Have you been able to try using v1.20.x in case this issue was already addressed in newer versions?
Thank you @roiedanino .
I have upgraded ucx to 1.12.0-rc1. The issue still exists.
When I use an independent ucp context in each thread, it can reduce spinlock contention to some extent. The 4K IOPS can be increased from 1.3 million to 2.4 million.
I am not sure whether this usage will introduce additional issues. Nevertheless, I still find that spinlock contention accounts for a significant proportion.
@ivanallen, can you please share the program or provide a minimal reproducer so it will be easier for us to reproduce?
@ivanallen, can you please share the program or provide a minimal reproducer so it will be easier for us to reproduce?
Hi @roiedanino
I used ucx_perftest to reproduce the issue.
[root@G02-100G-ASW016-M07 ucx-1.20.0]# ./install-release/bin/ucx_info -v
# Library version: 1.20.0
# Library path: /root/test/ucx-1.20.0/install-release/lib/libucs.so.0
# API headers version: 1.20.0
# Git branch '', revision 543c323
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/root/test/ucx-1.20.0/install-release
server
./install-release/bin/ucx_perftest -t ucp_am_lat -s 4096 -T 8
client
./install-release/bin/ucx_perftest -t ucp_am_lat 10.10.16.167 -s 4096 -T 8 -n 10000000
server's flame graph: