Data inconsistency in am protocol
Describe the bug
We constructed RPC using the am protocol, but found that the data received incorrect values at the server when the client has been sending continuously for a while.
thread mode: only one thread in client and server and set UCS_THREAD_MODE_SINGLE.
environment:
UCX_LOG_LEVEL=data UCX_TLS=rc,tcp UCX_PROTO_INFO=data
When we use UCX_RCACHE_ENABLE=no or UCX_TLS=tcp, this problem no longer occurs.
More info and log
I print the first 4 bytes of the buffer in ucx
-- this is ucx log in client
[1745470055.427691] [node-57:180805:1] rc_mlx5.inl:464 UCX DATA QP 0x552c [891]
SEND s-e
[inl len 10]
[va 0x525002eb3100[08 64 10 c8] len 5 lkey 0x269fb5]
[va 0x521000039900[61 61 61 61] len 4096 lkey 0x6a7396]
[va 0x7fd47cffdf80[58 50 52 43] len 107 lkey 0x6a7295] -- am 24
-- this is rpc log in client. This print is the print after receiving
-- the error returned by the server, which is used to prove that
-- the buffer has not been modified.
[2025-04-24 04:47:35.428] [error] [controller.cc:171] desc:Controller done
request_id:25022
states:SEND,READY_TO_RECV,RECV,SEND_DONE,RECV_DONE,IOV,CRC32_SET
error_code:22 error_text:RPC CRC32 check failed
iov:0x525002eb3100:5, 0x521000039900:4096
body:[08 64 10 C8 01 61 61 61 61 61 61 61 61 61 61 61]
As you can see, we sent [08 64 10 c8](only show first 4 bytes at iov[0]) and [61 61 61 61] (only show first 4 bytes at iov[1])
But server receive [1C 15 00 00 0D 61 61 61 61 61 61 61 61 61 61 61] (only show first 16 bytes iov[0] + iov[1]) but not [08 64 10 C8 01 61 61 61 61 61 61 61 61 61 61 61]
-- this log in ucx server
[1745470055.426669] [node-58:105142:1] rc_mlx5.inl:400 UCX DATA RECV qp 0x7ba0 [va 0x7fd64ca6bd80 len 4218] -- am 24
-- this log in rpc server
[2025-04-24 04:47:35.426] [105156] [] [debug] worker.cc:334 desc:Received RPC message
data_length:4101
data_ptr:0x7fd64ca6bd8a
hex:[1C 15 00 00 0D 61 61 61 61 61 61 61 61 61 61 61]
is_rndv:false rpc_meta:request {
service_name: "xrpc.example.EchoService"
method_name: "xrpc.example.EchoService.Add"
}
correlation_id: 25022
attachment_size: 4096
rpc_crc32: 2629633783
By using tcpdump to capture RDMA packet, we found that it was incorrect when the data was sent.
[root@node-57 xrpc]# show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_bond_0 1 0 fe80:0000:0000:0000:a288:c2ff:fe33:70e4 v1 cluster
mlx5_bond_0 1 1 fe80:0000:0000:0000:a288:c2ff:fe33:70e4 v2 cluster
mlx5_bond_0 1 2 0000:0000:0000:0000:0000:ffff:0a10:2739 10.16.39.57 v1 cluster
mlx5_bond_0 1 3 0000:0000:0000:0000:0000:ffff:0a10:2739 10.16.39.57 v2 cluster
mlx5_bond_1 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe3a:6940 v1 gateway
mlx5_bond_1 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe3a:6940 v2 gateway
mlx5_bond_1 1 2 0000:0000:0000:0000:0000:ffff:0a10:1d39 10.16.29.57 v1 br-vsw-1
mlx5_bond_1 1 3 0000:0000:0000:0000:0000:ffff:0a10:1d39 10.16.29.57 v2 br-vsw-1
mlx5_bond_1 1 4 fe80:0000:0000:0000:584b:59ff:fef5:6bac v1 br-vsw-1
mlx5_bond_1 1 5 fe80:0000:0000:0000:584b:59ff:fef5:6bac v2 br-vsw-1
n_gids_found=10
ofed_info
OFED-internal-23.10-1.1.9:
clusterkit:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/clusterkit-1.11.442-1.2310055.src.rpm
dpcp:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/dpcp-1.1.43-1.2310055.src.rpm
hcoll:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/hcoll-4.8.3223-1.2310055.src.rpm
ibarr:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/ibarr-0.1.3-1.2310055.src.rpm
ibdump:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/ibdump-6.0.0-1.2310055.src.rpm
ibsim:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/ibsim-0.12-1.2310055.src.rpm
ibutils2:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/ibutils2-2.1.1-0.1.MLNX20231105.g79770a56.2310055.src.rpm
iser:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff
isert:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff
kernel-mft:
mlnx_ofed_mft/kernel-mft-4.26.1-3.src.rpm
knem:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/knem-1.1.4.90mlnx3-OFED.23.10.0.2.1.1.src.rpm
libvma:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/libvma-9.8.40-1.src.rpm
libxlio:
/sw/release/sw_acceleration/xlio/libxlio-3.20.8-1.src.rpm
mlnx-dpdk:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mlnx-dpdk-22.11.0-2310.1.0.2310055.src.rpm
mlnx-en:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff
mlnx-ethtool:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mlnx-ethtool-6.4-1.2310055.src.rpm
mlnx-iproute2:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mlnx-iproute2-6.4.0-1.2310055.src.rpm
mlnx-nfsrdma:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
mlnx-nvme:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff
mlnx-ofa_kernel:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff
mlnx-tools:
https://github.com/Mellanox/mlnx-tools mlnx_ofed
commit d3edfb102ad4c3103796192b3719a5f9cd24c7f9
mlx-steering-dump:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mlx-steering-dump-1.0.0-0.2310055.src.rpm
mpitests:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mpitests-3.2.21-8418f75.2310055.src.rpm
mstflint:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mstflint-4.16.1-2.2310055.src.rpm
multiperf:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/multiperf-3.0-3.0.2310055.src.rpm
ofed-docs:
docs.git mlnx_ofed-4.0
commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc
openmpi:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/openmpi-4.1.7a1-1.2310055.src.rpm
opensm:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/opensm-5.17.0.MLNX20231105.d437ae0a-0.1.2310055.src.rpm
openvswitch:
https://gitlab-master.nvidia.com/sdn/ovs mlnx_ofed_23_10
commit c5eb44be9b4d8ee0f6b367b439c582f3487cc741
perftest:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/perftest-23.10.0-0.29.g0705c22.2310055.src.rpm
rdma-core:
mlnx_ofed/rdma-core.git mlnx_ofed_23_10
commit 6c404a661569eb130be94df67f8a98e6794c0c35
rshim:
mlnx_ofed_soc/rshim-2.0.17-0.g0caa378.src.rpm
sharp:
mlnx_ofed_sharp/sharp-3.5.1.MLNX20231116.7fcef5af.tar.gz
sockperf:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/sockperf-3.10-0.git5ebd327da983.2310055.src.rpm
srp:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff
ucx:
mlnx_ofed_ucx/ucx-1.16.0-1.src.rpm
xpmem-lib:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/xpmem-lib-2.7-0.2310055.src.rpm
xpmem:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/xpmem-2.7.3-1.2310055.src.rpm
Installed Packages:
-------------------
libibverbs-utils
sharp
mlnx-iproute2
libxpmem
perftest
ucx
rdma-core
opensm-static
librdmacm-utils
libibumad
ucx-devel
mlnx-ofa_kernel
mlnx-ofa_kernel-modules
mlnx-ethtool
libibverbs
ucx-knem
dpcp
ucx-xpmem
srp_daemon
knem
iser
opensm
isert
hcoll
ibacm
srp
librdmacm
ibsim
openmpi
ucx-ib
xpmem-modules
rshim
mlnx-tools
ucx-rdmacm
mlnxofed-docs
ucx-cma
ibdump
xpmem
mlnx-nvme
rdma-core-devel
knem-modules
opensm-devel
mlnx-ofa_kernel-source
mlnx-ofa_kernel-devel
opensm-libs
infiniband-diags
mpitests_openmpi
kernel-mft
ibutils2
Are you using multiple threads? Can you pls check if setting UCX_IB_MMIO_MODE=db_lock fixes the issue? If it does you need to initialize UCP worker with threading support
Are you using multiple threads? Can you pls check if setting
UCX_IB_MMIO_MODE=db_lockfixes the issue? If it does you need to initialize UCP worker with threading support
Hi @brminich
We only have one thread to poll ucp worker in client and server. (one worker per thread, we use run to completion thread mode)
// worker
_params.thread_mode = UCS_THREAD_MODE_SINGLE;
_params.field_mask |= UCP_WORKER_PARAM_FIELD_THREAD_MODE;
_params.client_id = client_id;
_params.field_mask |= UCP_WORKER_PARAM_FIELD_CLIENT_ID;
// ep
_params = {}; // clear
_params.field_mask |= UCP_EP_PARAM_FIELD_SOCK_ADDR |
UCP_EP_PARAM_FIELD_FLAGS |
UCP_EP_PARAM_FIELD_ERR_HANDLER |
UCP_EP_PARAM_FIELD_ERR_HANDLING_MODE;
_params.flags |= UCP_EP_PARAMS_FLAGS_CLIENT_SERVER | UCP_EP_PARAMS_FLAGS_SEND_CLIENT_ID;
_params.err_mode = UCP_ERR_HANDLING_MODE_PEER;
_params.err_handler.cb = ucx_complete_error_handler;
_params.err_handler.arg = _closure;
_params.sockaddr.addr = reinterpret_cast<struct sockaddr*>(&_address);
_params.sockaddr.addrlen = sizeof(_address);
When I set UCX_IB_MMIO_MODE=db_lock UCX_TLS=rc,tcp UCX_PROTO_INFO=y , I found that the program was blocked.
#0 0x00007ffff609cea2 in pthread_spin_lock@GLIBC_2.2.5 () from /lib64/libc.so.6
#1 0x00007fffe522be30 in ucs_spin_lock (lock=<optimized out>) at /root/xrpc/ucx-1.18.0/src/ucs/type/spinlock.h:83
#2 uct_ib_mlx5_post_send (hw_ci_updated=1, wqe_size=<optimized out>, ctrl=<optimized out>, wq=0x512000024838) at /root/xrpc/ucx-1.18.0/src/uct/ib/mlx5/ib_mlx5.inl:588
#3 uct_rc_mlx5_common_post_send (log_sge=0x0, max_log_sge=<optimized out>, imm=0, wqe_size=<optimized out>, dci_channel=0, fm_ce_se=<optimized out>, opmod=0 '\000', opcode=10 '\n', txwq=0x512000024838, txqp=0x5120000247d0, qp_type=2,
iface=0x52d00041a400) at rc/rc_mlx5.inl:469
#4 uct_rc_mlx5_txqp_dptr_post_iov (max_log_sge=<optimized out>, dci_channel=<optimized out>, fm_ce_se=<optimized out>, av_size=<optimized out>, opaque_mr=<optimized out>, ib_imm_be=<optimized out>, app_ctx=<optimized out>,
tag=<optimized out>, rkey=<optimized out>, remote_addr=<optimized out>, am_hdr_len=<optimized out>, am_hdr=<optimized out>, am_id=<optimized out>, iovcnt=<optimized out>, iov=<optimized out>, opcode_flags=<optimized out>,
txwq=<optimized out>, txqp=<optimized out>, qp_type=<optimized out>, iface=<optimized out>) at rc/rc_mlx5.inl:928
#5 uct_rc_mlx5_base_ep_zcopy_post (comp=0x52e00000a398, op_flags=0, handler=<optimized out>, wqe_flags=2 '\002', opaque_mr=0x0, ib_imm_be=0, app_ctx=0, tag=0, rdma_rkey=0, rdma_raddr=0, am_hdr_len=<optimized out>,
am_hdr=<optimized out>, am_id=<optimized out>, iov_total_length=0, iovcnt=3, iov=0x7fffe34064d0, opcode=10, ep=0x5120000247c0) at rc/rc_mlx5.inl:1941
#6 uct_rc_mlx5_base_ep_am_zcopy (tl_ep=0x5120000247c0, id=<optimized out>, header=<optimized out>, header_length=<optimized out>, iov=0x7fffe34064d0, iovcnt=3, flags=0, comp=0x52e00000a398) at rc/rc_mlx5_ep.c:341
#7 0x00007ffff7759057 in uct_ep_am_zcopy (comp=0x52e00000a398, flags=0, iovcnt=<optimized out>, iov=0x7fffe34064d0, header_length=8, header=0x7fffe34062d8, id=<optimized out>, ep=<optimized out>)
at /root/xrpc/ucx-1.18.0/src/uct/api/uct.h:3112
#8 ucp_am_eager_multi_zcopy_send_func (lane_shift=<synthetic pointer>, next_iter=0x7fffe3406490, lpriv=<optimized out>, req=0x52e00000a300) at am/eager_multi.c:279
#9 ucp_proto_multi_progress (dt_mask=5, complete_func=<optimized out>, send_func=<optimized out>, mpriv=0x522000042160, req=0x52e00000a300) at /root/xrpc/ucx-1.18.0/src/ucp/proto/proto_multi.inl:177
#10 ucp_proto_multi_zcopy_progress (uct_comp_cb=0x7ffff7758010 <ucp_am_eager_zcopy_completion>, complete_func=<optimized out>, send_func=<optimized out>, dt_mask=5, uct_mem_flags=256, init_func=<optimized out>, mpriv=0x522000042160,
req=0x52e00000a300) at /root/xrpc/ucx-1.18.0/src/ucp/proto/proto_multi.inl:246
#11 ucp_am_eager_multi_zcopy_proto_progress (self=0x52e00000a3e0) at am/eager_multi.c:296
#12 0x00007ffff787ee95 in ucp_request_try_send (req=0x52e00000a300) at /root/xrpc/ucx-1.18.0/src/ucp/core/ucp_request.inl:307
#13 ucp_request_send (req=<optimized out>) at /root/xrpc/ucx-1.18.0/src/ucp/core/ucp_request.inl:330
#14 ucp_wireup_replay_pending_request (ucp_ep=0x7fffe03eb000, self=0x52e00000a3e0) at wireup/wireup.c:985
#15 ucp_wireup_replay_pending_requests (ucp_ep=ucp_ep@entry=0x7fffe03eb000, tmp_pending_queue=tmp_pending_queue@entry=0x7fffe34067c0) at wireup/wireup.c:997
#16 0x00007ffff787f412 in ucp_wireup_eps_progress (arg=0x7fffe03eb000) at wireup/wireup.c:503
#17 0x00007ffff76897c1 in ucs_callbackq_oneshot_elems_dispatch (cbq=0x50f000005860) at datastruct/callbackq.c:457
#18 ucs_callbackq_proxy_callback (arg=0x50f000005860) at datastruct/callbackq.c:477
#19 0x00007ffff77a2faa in ucs_callbackq_dispatch (cbq=<optimized out>) at /root/xrpc/ucx-1.18.0/src/ucs/datastruct/callbackq.h:215
#20 uct_worker_progress (worker=<optimized out>) at /root/xrpc/ucx-1.18.0/src/uct/api/uct.h:2813
#21 ucp_worker_progress (worker=0x7fffe3cd8800) at core/ucp_worker.c:3033
#22 0x0000000000f652e6 in xrpc::ucp::UcpWorker::progress (this=0x50300007efd0) at src/ucp/ucp_worker.cc:75
#23 0x00000000007325a0 in xrpc::Channel::poll (this=0x51b000055c80) at src/channel.cc:1008
#24 0x000000000059ab0b in pbrpcframework::PressClient::poll (this=0x5080000115a0, thread_id=0) at examples/xrpc/rpc_press/rpc_press_impl.cc:100
#25 0x00000000005a2926 in pbrpcframework::RpcPress::sync_client (this=0x51d000003280) at examples/xrpc/rpc_press/rpc_press_impl.cc:270
#26 0x000000000059faf2 in pbrpcframework::RpcPress::sync_call_thread (arg=0x51d000003280) at examples/xrpc/rpc_press/rpc_press_impl.cc:195
#27 0x00007ffff794966a in asan_thread_start(void*) () from /lib64/libasan.so.8
#28 0x00007ffff6095d22 in start_thread () from /lib64/libc.so.6
#29 0x00007ffff611ad40 in clone3 () from /lib64/libc.so.6
If I use UCX_IB_MMIO_MODE=db_lock UCX_TLS=**rc_v**,tcp UCX_PROTO_INFO=y , it still have problem.
but is your client and/or server using multiple threads? Not for polling worker, but for something else? If yes, pls try to initialize UCP worker mode as UCS_THREAD_MODE_SERIALIZED instead of UCS_THREAD_MODE_SINGLE.
@brminich
but is your client and/or server using multiple threads? Not for polling worker, but for something else? If yes, pls try to initialize UCP worker mode as UCS_THREAD_MODE_SERIALIZED instead of UCS_THREAD_MODE_SINGLE.
Yes client/server using multiple threads. Those threads have nothing to do with ucx. Only one thread use ucx worker.
I tried to use UCS_THREAD_MODE_SERIALIZED, but the issue still existed.
@brminich
I found some clues. When I only pass a piece of data (using continuous buffer), only the first 8 bytes of data are corrupted.
-- client log (server response crc check failed)
[2025-04-24 09:29:39.137] [error] [controller.cc:171] desc:Controller done
request_id:19368
states:SEND,READY_TO_RECV,RECV,SEND_DONE,RECV_DONE,CRC32_SET
error_code:22 error_text:RPC CRC32 check failed
iov:0x525000901100:4098
body:[0A FF 1F 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D]
-- server log
[2025-04-24 09:29:39.136] [124545] [] [debug] worker.cc:334 desc:Received RPC message
data_length:4098
data_ptr:0x7fc57801ed49
hex:[3D 17 00 00 0D 00 00 00 46 47 48 49 4A 4B 4C 4D]
is_rndv:false rpc_meta:request {
service_name: "xrpc.example.EchoService"
method_name: "xrpc.example.EchoService.Echo"
}
correlation_id: 19368
rpc_crc32: 700057043
I tried several times and found that the data that was corrupted had certain patterns. In the same process, the corrupted data is the same, and a new value is generated each time the system restarts.
3D 17 00 00 0D 00 00 00 46 47 48 49 4A 4B 4C 4D
^ ^ ^ ^ ^ ^ ^ ^
62 27 00 00 0D 00 00 00 46 47 48 49 4A 4B 4C 4D
^ ^ ^ ^ ^ ^ ^ ^
23 17 00 00 0D 00 00 00 46 47 48 49 4A 4B 4C 4D
^ ^ ^ ^ ^ ^ ^ ^
The captured packet shows that the data was already incorrect during transmission:
The corresponding log shows that the data is still correct when ibv_post_send is sent. After the RPC returns, the data is still correct:
[2025-04-24 10:52:49.307] [debug] [channel.cc:95] desc:issue rpc request_id:18338 body_length:4098 send_buffer_ptr:0x5250016bc100 body:[0A FF 1F 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D]
[1745491969.307624] [node-57:205502:1] ucp_am.c:1036 UCX REQ allocated request 0x52e00000a300
[1745491969.307638] [node-57:205502:1] proto_common.c:719 UCX REQ req 0x52e00000a300: am_send_reply(multi) from host memory length 4179 zero-copy rc_verbs/mlx5_bond_0:1/path0 6185.2 MB/s 0.64 us
[1745491969.307643] [node-57:205502:1] proto_common.inl:172 UCX REQ req 0x52e00000a300: set to stage 0, progress function '???'
[1745491969.307648] [node-57:205502:1] ucp_request.c:740 UCX REQ req 0x52e00000a300: progress am/egr/single/zcopy/reply {???} ep_cfg[3] rkey_cfg[255] offset 0/4098
[1745491969.307652] [node-57:205502:1] proto_common.inl:73 UCX REQ req 0x52e00000a300: ucp_proto_request_zcopy_init for am/egr/single/zcopy/reply
[1745491969.307658] [node-57:205502:1] rcache.inl:43 UCX TRACE ucp_rcache: lru remove region 0x50f0000aff90 [0x5250016bc100..0x5250016bd110] gt rw ref 2 md[0]=mlx5_bond_0
[1745491969.307663] [node-57:205502:1] ucp_mm.inl:32 UCX TRACE memh 0x50f0000aff90: address 0x5250016bc100/0x5250016bc100 length 4098/4112 host md_map 1 obtained from rcache
[1745491969.307676] [node-57:205502:1] rc_verbs_ep.c:54 UCX DATA QP 0x8f5f wrid 0x47a5 SEND s-e [va 0x7f74954c7e80[1a 12 00 01] len 9 lkey 0xecab] [va 0x5250016bc100[0a ff 1f 41] len 4098 lkey 0x48c062] [va 0x7f7496dfdf80[58 50
52 43] len 89 lkey 0x511cbc] -- am 26
[1745491969.307681] [node-57:205502:1] ucp_request.c:756 UCX REQ req 0x52e00000a300: progress protocol am/egr/single/zcopy/reply returned: Success
[1745491969.307685] [node-57:205502:1] proto_common.inl:256 UCX DATA request 0x52e00000a300 send.cb set to 0xf4c9cb, user data: 0x50a0027bfd40
[1745491969.307692] [node-57:205502:1] proto_common.inl:261 UCX REQ returning send request 0x52e00000a300: am_send_reply 0/4098 contiguous buffer:0x5250016bc100
[1745491969.307699] [node-57:205502:1] ucp_mm.inl:92 UCX TRACE memh 0x50f0000aff90: release address 0x5250016bc100 length 4112 md_map 1
[1745491969.307705] [node-57:205502:1] rcache.inl:29 UCX TRACE ucp_rcache: lru add region 0x50f0000aff90 [0x5250016bc100..0x5250016bd110] gt rw ref 2 md[0]=mlx5_bond_0
[1745491969.307709] [node-57:205502:1] ucp_request.inl:246 UCX REQ completing send request 0x52e00000a300 (0x52e00000a410) ---c-- Success
[1745491969.307715] [node-57:205502:1] ucp_request.c:215 UCX REQ free request 0x52e00000a300 (0x52e00000a410) d--c--
[1745491969.307718] [node-57:205502:1] ucp_request.inl:236 UCX REQ put request 0x52e00000a300
[1745491969.307981] [node-57:205502:1] rc_verbs_impl.h:108 UCX DATA RECV qp 0x8f5f [va 0x7f749865b1c0 len 53] -- am 23
[2025-04-24 10:52:49.307] [debug] [channel.cc:350] desc:Received AM RPC message header_size:44 data_size:0
[2025-04-24 10:52:49.308] [debug] [channel.cc:388] desc:Received RPC message body_size:0 is_rndv:false rpc_meta:response {
error_code: 22
error_text: "RPC CRC32 check failed"
}
correlation_id: 18338
[2025-04-24 10:52:49.308] [error] [controller.cc:171] desc:Controller done request_id:18338 states:SEND,READY_TO_RECV,RECV,SEND_DONE,RECV_DONE,CRC32_SET error_code:22 error_text:RPC CRC32 check failed iov:0x5250016b
c100:4098 body:[0A FF 1F 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D]
@brminich
A new clue: The buffer sent by ibv_post_send is actually not the one seen in the log, but a buffer that has been released.
The request sending is serial. The next one will only be sent after receiving the result from the server. Theoretically, this situation won't occur. I don't know what happened.
These Buffers will be returned to the system only after the request result is obtained. I'm not sure how ucx manages the rcache. What will happen if the buffer is returned to the system but still exists in the rcache?
--> client
[2025-04-25 05:53:56.702] [36789] request_id:2506 alloc_buffer:0x52500190a100 buffer:"XXXXXXXX 2506 ABCDE..."
[2025-04-25 05:53:56.702] [36789] request_id:2506 receive server response and release 0x52500190a100
[1745560436.702221] [node-57:36789:0] rc_verbs_ep.c:54 UCX DATA QP 0x1b7e wrid 0x9cd SEND s-e
[va 0x7f14416e8e80[1a 12 00 01] len 9 lkey 0x121df]
[va 0x52500190a100[0a 8e 20 58] len 4113 lkey 0x290b89]
[va 0x7f1442ffdf80[58 50 52 43] len 88 lkey 0x1bc9c3] -- am 26
...
[2025-04-25 05:54:05.456] [36789] request_id:7682 alloc_buffer:0x52500190a100 buffer:"XXXXXXXX 7802 ABCDE..."
[1745560445.456262] [node-57:36789:0] rc_verbs_ep.c:54 UCX DATA QP 0x1b7e wrid 0x1e05 SEND s-e
[va 0x7f14416e8e80[1a 12 00 01] len 9 lkey 0x121df]
[va 0x52500190a100[0a 8e 20 58] len 4113 lkey 0x290b89]
[va 0x7f1442ffdf80[58 50 52 43] len 88 lkey 0x1bc9c3] -- am 26
<-- server
[2025-04-25 05:54:05.456] [17291] worker.cc:418 request_id:7682 desc:CRC32 check failed
crc:4273242618 expected:891675410
hex:[E7 10 00 00 00 00 00 00 58 58 58 20 32 35 30 36] "_____XXX 2506"
The request(7682) received the buffer data of request(2506).
I guess that request(2056) has released the buffer, but the physical memory pointed to by the lkey is still the physical address of the buffer of request(2056). While request(7682) has obtained the same linear address, its physical memory has changed, but the lkey it uses remains unchanged. All of this led to the sending of old data.
When I allocated memory using the buffer pool, the problem no longer occurred.
@brminich @yosefe
I'd need a few more details about your application. When you release the buffer, it should be unregistered and removed from the rcache.
Could you clarify the following:
- How do you allocate and release memory at the point where the issue occurs?
- Which buffer pool do you use when the problem goes away?
- Are you calling any UCX functions from another thread (e.g., ucp_mem_map or similar routines)? Not necessarily UCP worker related
@brminich
- Our usage method is similar to ucp_client_server, using the C++ new/delete method to dynamically allocate and release memory. This approach will cause RDMA to send old data, which will result in errors.
I learned that UCX dynamically registers a memory buffer, but this is transparent to me. I don't know how to notify UCX to unregister memory to remove from rcache after releasing it.
-
The buffer pool (a structure written by ourselves) will cache the requested memory and not return it to the os. If used in this way, there won't be any problems.
-
In our program, ucp_mem_map is not used. Only one thread will use the API related to UCX, while the other threads are all related to background management, such as some monitoring programs.