ucx
ucx copied to clipboard
UCX use a large amount of SYSV HugePage memory
Describe the bug
When we use UCX, we found that the system's Huge Pages are heavily utilized, with a total of over 4000 huge pages(2MB) occupied. In some scenarios, there is a large amount of memory being requested in a short period of time (around 1 minute), totaling around 2GB of huge page memory.
By examining the information in /proc/[pid]/numa_maps
, we found that SYSV huge pages are heavily utilized. After analyzing our code, we found that only UCX uses SYSV to allocate large page memory, so we suspect that UCX is abnormally occupying these pages.
What could be occupying this SYSV huge page memory? What scenarios might trigger this problem? Are there any methods to avoid it?
Our program using UCX uses the ucp_tag_send_nb
and ucp_tag_recv_nb
interfaces. Internally, there is a UCX server that receives external requests, with multiple clients establishing connections with it, which could be either rc_x
or tcp
. Additionally, there are several UCX clients establishing connections with other services, only using rc_x
. A schematic diagram is shown below:
The SYSV hugepage in /proc/[pid]/numa_maps
:
Steps to Reproduce
- Command line
- UCX version used: v1.12.0
- UCX configure flags
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni
- Any UCX environment variables used
setenv("UCX_MAX_EAGER_LANES", "2", 1);
setenv("UCX_IB_SEG_SIZE", "2k", 1);
setenv("UCX_RC_RX_QUEUE_LEN", "1024", 1);
setenv("UCX_RC_MAX_RD_ATOMIC", "16", 1);
setenv("UCX_RC_ROCE_PATH_FACTOR", "2", 1);
setenv("UCX_RNDV_THRESH", "32k", 1);
setenv("UCX_IB_TRAFFIC_CLASS", "166", 1);
setenv("UCX_SOCKADDR_CM_ENABLE", "y", 1);
setenv("UCX_RC_MAX_GET_ZCOPY", "32k", 1);
setenv("UCX_RC_TX_NUM_GET_OPS", "8", 1);
setenv("UCX_RC_TX_NUM_GET_BYTES", "256k", 1);
setenv("UCX_RC_TX_CQ_MODERATION", "1", 1);
setenv("UCX_HANDLE_ERRORS", "", 1);
setenv("UCX_IB_FORK_INIT", "n", 1);
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
-
CentOS Linux release 7.2.1511 (Core)
3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
-
- For RDMA/IB/RoCE related issues:
- Driver version:
- rdma-core-52mlnx1-1.52104.x86_64
- MLNX_OFED_LINUX-5.2-1.0.4.0:
- HW information from
ibstat
oribv_devinfo -vv
command
- Driver version:
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 16.27.1016
node_guid: b8ce:f603:00e9:4d5e
sys_image_guid: b8ce:f603:00e9:4d5e
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000080
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 262144
max_qp_wr: 32768
device_cap_flags: 0xed721c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
RAW_IP_CSUM
MANAGED_FLOW_STEERING
Unknown flags: 0xC8400000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4194304
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
ODP_SUPPORT_IMPLICIT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
xrc_odp_caps:
SUPPORT_SEND
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 156250kHZ
raw packet caps:
C-VLAN stripping offload
Scatter FCS offload
IP csum offload
Delay drop
device_cap_flags_ex: 0x30000055ED721C36
RAW_SCATTER_FCS
PCI_WRITE_END_PADDING
Unknown flags: 0x3000004100000000
tso_caps:
max_tso: 262144
supported_qp:
SUPPORT_RAW_PACKET
rss_caps:
max_rwq_indirection_tables: 65536
max_rwq_indirection_table_size: 2048
rx_hash_function: 0x1
rx_hash_fields_mask: 0x800000FF
supported_qp:
SUPPORT_RAW_PACKET
max_wq_type_rq: 8388608
packet_pacing_caps:
qp_rate_limit_min: 1kbps
qp_rate_limit_max: 25000000kbps
supported_qp:
SUPPORT_RAW_PACKET
tag matching not supported
cq moderation caps:
max_cq_count: 65535
max_cq_period: 4095 us
maximum available device memory: 131072Bytes
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
max_msg_sz: 0x40000000
port_cap_flags: 0x04010000
port_cap_flags2: 0x0000
max_vl_num: invalid value (0)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 1
gid_tbl_len: 256
subnet_timeout: 0
init_type_reply: 0
active_width: 1X (1)
active_speed: 25.0 Gbps (32)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:bace:f6ff:fee9:4d5e, RoCE v1
GID[ 1]: fe80::bace:f6ff:fee9:4d5e, RoCE v2
GID[ 2]: fe80:0000:0000:0000:bace:f6ff:fee9:4d5e, RoCE v1
GID[ 3]: fe80::bace:f6ff:fee9:4d5e, RoCE v2
GID[ 4]: 0000:0000:0000:0000:0000:ffff:0a4e:0588, RoCE v1
GID[ 5]: ::ffff:10.78.5.136, RoCE v2
Additional information (depending on the issue)
- Output of
ucx_info -d
to show transports and devices recognized by UCX
ucx_info -d
#
# Memory domain: posix
# Component: posix
# allocate: <= 64967552K
# remote key: 24 bytes
# rkey_ptr is supported
#
# Transport: posix
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
#
# Transport: sysv
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: self
# Device: memory0
# Type: loopback
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: bond1
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 5658.18/ppn + 0.00 MB/sec
# latency: 5212 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: mlx5_bond_0
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
#
# Transport: rc_verbs
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 38
# device num paths: 2
# max eps: 256
# device address: 18 bytes
# ep address: 5 bytes
# error handling: peer failure, ep_check
#
#
# Transport: rc_mlx5
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 38
# device num paths: 2
# max eps: 256
# device address: 18 bytes
# ep address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: dc_mlx5
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 860 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 38
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 5 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 880
# connection: to ep, to iface
# device priority: 38
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 132
# connection: to ep, to iface
# device priority: 38
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
# max_conn_priv: 54 bytes
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: knem
# Component: knem
# register: unlimited, cost: 18446744073709551616000000000 nsec
# remote key: 16 bytes
#
# Transport: knem
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 13862.00/ppn + 0.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 0 bytes
# error handling: none
#
assuming UCX allocates huge pages with sysv transport you could try:
- disable huge pages
UCX_SYSV_HUGETLB_MODE=no
- disable sysv transport like
UCX_TLS=rc_x,tcp,self
else if it is related to internal buffers there is also:
-
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
-
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
-
UCX_RC_MLX5_ALLOC=thp,mmap,heap
-
UCX_ALLOC_PRIO=md:sysv,md:posix,thp,md:*,mmap,heap
you could try removing md:sysv/thp/huge. also, ucp_mem_map(UCP_MEM_MAP_ALLOCATE)
call from the app could lead to sysv allocation.
assuming UCX allocates huge pages with sysv transport you could try:
- disable huge pages
UCX_SYSV_HUGETLB_MODE=no
- disable sysv transport like
UCX_TLS=rc_x,tcp,self
else if it is related to internal buffers there is also:
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_RC_MLX5_ALLOC=thp,mmap,heap
UCX_ALLOC_PRIO=md:sysv,md:posix,thp,md:*,mmap,heap
you could try removing md:sysv/thp/huge. also,
ucp_mem_map(UCP_MEM_MAP_ALLOCATE)
call from the app could lead to sysv allocation.
I tried starting the program with the following configuration and found that SYSV huge pages are still being used. In the UCX log, I discovered that the huge pages are being used by ucp_am_bufs
.
UCX_ALLOC_PRIO=md:sysv,md:posix,md:*,mmap,heap
UCX_POSIX_ALLOC=md,mmap,heap
UCX_SYSV_ALLOC=md,mmap,heap
UCX_SELF_ALLOC=md,mmap,heap
UCX_TCP_ALLOC=md,mmap,heap
UCX_RC_VERBS_ALLOC=md,mmap,heap
UCX_RC_MLX5_ALLOC=md,mmap,heap
UCX_DC_MLX5_ALLOC=md,mmap,heap
UCX_UD_VERBS_ALLOC=md,mmap,heap
UCX_UD_MLX5_ALLOC=md,mmap,heap
UCX_CMA_ALLOC=mmap,heap
UCX_KNEM_ALLOC=md,mmap,heap
UCX_POSIX_HUGETLB_MODE=n
UCX_SYSV_HUGETLB_MODE=n
The alloc_chunk
function of ucp_am_bufs
is ucs_mpool_hugetlb_malloc
, which always allocates huge pages for 64KB-sized elements unless the allocation of huge pages fails. There are no parameters to modify this behavior.
I speculate that the TCP server in UCX is using this mpool.
@yosefe, shall we allow non-huge pages allocation for ucp_am_bufs
?
@yosefe, shall we allow non-huge pages allocation for
ucp_am_bufs
?
I still have a question: why does using TCP transports require so many ucp_am_bufs
? In our production environment, there have been instances where 1000 (2GB) huge pages were requested in a short period of time.
hi, @littleneko is it an option to upgrade the UCX version, or check if https://github.com/openucx/ucx/pull/7544/commits/7ca49c44817e1ca5f0531431d365d327601c7055 is part of the version in use? In older UCX versions, the receive buffers were allocated according to the max possible frag size, however in newer versions it's allocated according to actual size, in powers of 2. ucp_am_bufs is allocating memory for unexpected received messages - those that were received before ucp_tag_recv_nbx() with a matching tag/mask was called.
hi, @littleneko is it an option to upgrade the UCX version, or check if 7ca49c4 is part of the version in use? In older UCX versions, the receive buffers were allocated according to the max possible frag size, however in newer versions it's allocated according to actual size, in powers of 2. ucp_am_bufs is allocating memory for unexpected received messages - those that were received before ucp_tag_recv_nbx() with a matching tag/mask was called.
Thank you.
I have confirmed that the UCX version we are using (v1.12.x) with the commit 7ca49c4
.
And I reading the ucp_eager_tagged_handler
function, I found that ucp_am_bufs
are used to save data only when the TAG of received message does not match.
In our code, we first send a message containing a tag and then send the data. If ucp_am_bufs
are used too much , does this imply that the first message containing the tag is lost?
hi, @littleneko is it an option to upgrade the UCX version, or check if 7ca49c4 is part of the version in use? In older UCX versions, the receive buffers were allocated according to the max possible frag size, however in newer versions it's allocated according to actual size, in powers of 2. ucp_am_bufs is allocating memory for unexpected received messages - those that were received before ucp_tag_recv_nbx() with a matching tag/mask was called.
Thank you. I have confirmed that the UCX version we are using (v1.12.x) with the commit
7ca49c4
. And I reading theucp_eager_tagged_handler
function, I found thatucp_am_bufs
are used to save data only when the TAG of received message does not match. In our code, we first send a message containing a tag and then send the data. Ifucp_am_bufs
are used too much , does this imply that the first message containing the tag is lost?
Messages are not lost; It means that the receiver has not called ucp_tag_recv_nbx() when the message arrived