ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Data inconsistency in am protocol

Open ivanallen opened this issue 8 months ago • 10 comments

Describe the bug

We constructed RPC using the am protocol, but found that the data received incorrect values at the server when the client has been sending continuously for a while.

thread mode: only one thread in client and server and set UCS_THREAD_MODE_SINGLE.

environment:

UCX_LOG_LEVEL=data UCX_TLS=rc,tcp  UCX_PROTO_INFO=data

When we use UCX_RCACHE_ENABLE=no or UCX_TLS=tcp, this problem no longer occurs.

More info and log

I print the first 4 bytes of the buffer in ucx

-- this is ucx log in client
[1745470055.427691] [node-57:180805:1]       rc_mlx5.inl:464  UCX  DATA    QP 0x552c [891] 
SEND s-e 
 [inl len 10]
 [va 0x525002eb3100[08 64 10 c8] len 5 lkey 0x269fb5] 
 [va 0x521000039900[61 61 61 61] len 4096 lkey 0x6a7396] 
 [va 0x7fd47cffdf80[58 50 52 43] len 107 lkey 0x6a7295] -- am 24


-- this is rpc log in client. This print is the print after receiving 
-- the error returned by the server, which is used to prove that 
-- the buffer has not been modified.

[2025-04-24 04:47:35.428] [error] [controller.cc:171] desc:Controller done      
  request_id:25022       
  states:SEND,READY_TO_RECV,RECV,SEND_DONE,RECV_DONE,IOV,CRC32_SET        
  error_code:22   error_text:RPC CRC32 check failed       
  iov:0x525002eb3100:5, 0x521000039900:4096       
  body:[08 64 10 C8 01 61 61 61 61 61 61 61 61 61 61 61]

As you can see, we sent [08 64 10 c8](only show first 4 bytes at iov[0]) and [61 61 61 61] (only show first 4 bytes at iov[1])

But server receive [1C 15 00 00 0D 61 61 61 61 61 61 61 61 61 61 61] (only show first 16 bytes iov[0] + iov[1]) but not [08 64 10 C8 01 61 61 61 61 61 61 61 61 61 61 61]

-- this log in ucx server
[1745470055.426669] [node-58:105142:1]       rc_mlx5.inl:400  UCX  DATA  RECV qp 0x7ba0 [va 0x7fd64ca6bd80 len 4218] -- am 24

-- this log in rpc server
[2025-04-24 04:47:35.426] [105156] [] [debug] worker.cc:334 desc:Received RPC message   
 data_length:4101        
 data_ptr:0x7fd64ca6bd8a 
 hex:[1C 15 00 00 0D 61 61 61 61 61 61 61 61 61 61 61]   
 is_rndv:false   rpc_meta:request {
  service_name: "xrpc.example.EchoService"
  method_name: "xrpc.example.EchoService.Add"
}
correlation_id: 25022
attachment_size: 4096
rpc_crc32: 2629633783

By using tcpdump to capture RDMA packet, we found that it was incorrect when the data was sent.

[root@node-57 xrpc]# show_gids
DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
mlx5_bond_0     1       0       fe80:0000:0000:0000:a288:c2ff:fe33:70e4                 v1      cluster
mlx5_bond_0     1       1       fe80:0000:0000:0000:a288:c2ff:fe33:70e4                 v2      cluster
mlx5_bond_0     1       2       0000:0000:0000:0000:0000:ffff:0a10:2739 10.16.39.57     v1      cluster
mlx5_bond_0     1       3       0000:0000:0000:0000:0000:ffff:0a10:2739 10.16.39.57     v2      cluster
mlx5_bond_1     1       0       fe80:0000:0000:0000:eaeb:d3ff:fe3a:6940                 v1      gateway
mlx5_bond_1     1       1       fe80:0000:0000:0000:eaeb:d3ff:fe3a:6940                 v2      gateway
mlx5_bond_1     1       2       0000:0000:0000:0000:0000:ffff:0a10:1d39 10.16.29.57     v1      br-vsw-1
mlx5_bond_1     1       3       0000:0000:0000:0000:0000:ffff:0a10:1d39 10.16.29.57     v2      br-vsw-1
mlx5_bond_1     1       4       fe80:0000:0000:0000:584b:59ff:fef5:6bac                 v1      br-vsw-1
mlx5_bond_1     1       5       fe80:0000:0000:0000:584b:59ff:fef5:6bac                 v2      br-vsw-1
n_gids_found=10

ofed_info

OFED-internal-23.10-1.1.9:

clusterkit:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/clusterkit-1.11.442-1.2310055.src.rpm

dpcp:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/dpcp-1.1.43-1.2310055.src.rpm

hcoll:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/hcoll-4.8.3223-1.2310055.src.rpm

ibarr:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/ibarr-0.1.3-1.2310055.src.rpm

ibdump:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/ibdump-6.0.0-1.2310055.src.rpm

ibsim:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/ibsim-0.12-1.2310055.src.rpm

ibutils2:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/ibutils2-2.1.1-0.1.MLNX20231105.g79770a56.2310055.src.rpm

iser:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff

isert:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff

kernel-mft:
mlnx_ofed_mft/kernel-mft-4.26.1-3.src.rpm

knem:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/knem-1.1.4.90mlnx3-OFED.23.10.0.2.1.1.src.rpm

libvma:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/libvma-9.8.40-1.src.rpm

libxlio:
/sw/release/sw_acceleration/xlio/libxlio-3.20.8-1.src.rpm

mlnx-dpdk:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mlnx-dpdk-22.11.0-2310.1.0.2310055.src.rpm

mlnx-en:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff

mlnx-ethtool:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mlnx-ethtool-6.4-1.2310055.src.rpm

mlnx-iproute2:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mlnx-iproute2-6.4.0-1.2310055.src.rpm

mlnx-nfsrdma:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10

mlnx-nvme:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff

mlnx-ofa_kernel:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff

mlnx-tools:
https://github.com/Mellanox/mlnx-tools mlnx_ofed
commit d3edfb102ad4c3103796192b3719a5f9cd24c7f9
mlx-steering-dump:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mlx-steering-dump-1.0.0-0.2310055.src.rpm

mpitests:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mpitests-3.2.21-8418f75.2310055.src.rpm

mstflint:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/mstflint-4.16.1-2.2310055.src.rpm

multiperf:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/multiperf-3.0-3.0.2310055.src.rpm

ofed-docs:
docs.git mlnx_ofed-4.0
commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc

openmpi:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/openmpi-4.1.7a1-1.2310055.src.rpm

opensm:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/opensm-5.17.0.MLNX20231105.d437ae0a-0.1.2310055.src.rpm

openvswitch:
https://gitlab-master.nvidia.com/sdn/ovs mlnx_ofed_23_10
commit c5eb44be9b4d8ee0f6b367b439c582f3487cc741

perftest:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/perftest-23.10.0-0.29.g0705c22.2310055.src.rpm

rdma-core:
mlnx_ofed/rdma-core.git mlnx_ofed_23_10
commit 6c404a661569eb130be94df67f8a98e6794c0c35
rshim:
mlnx_ofed_soc/rshim-2.0.17-0.g0caa378.src.rpm

sharp:
mlnx_ofed_sharp/sharp-3.5.1.MLNX20231116.7fcef5af.tar.gz

sockperf:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/sockperf-3.10-0.git5ebd327da983.2310055.src.rpm

srp:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_23_10
commit a675be032c722b022cd9166921fbc6195e7fe8ff

ucx:
mlnx_ofed_ucx/ucx-1.16.0-1.src.rpm

xpmem-lib:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/xpmem-lib-2.7-0.2310055.src.rpm

xpmem:
/sw/release/mlnx_ofed/IBHPC/OFED-internal-23.10-0.5.5/SRPMS/xpmem-2.7.3-1.2310055.src.rpm


Installed Packages:
-------------------

libibverbs-utils
sharp
mlnx-iproute2
libxpmem
perftest
ucx
rdma-core
opensm-static
librdmacm-utils
libibumad
ucx-devel
mlnx-ofa_kernel
mlnx-ofa_kernel-modules
mlnx-ethtool
libibverbs
ucx-knem
dpcp
ucx-xpmem
srp_daemon
knem
iser
opensm
isert
hcoll
ibacm
srp
librdmacm
ibsim
openmpi
ucx-ib
xpmem-modules
rshim
mlnx-tools
ucx-rdmacm
mlnxofed-docs
ucx-cma
ibdump
xpmem
mlnx-nvme
rdma-core-devel
knem-modules
opensm-devel
mlnx-ofa_kernel-source
mlnx-ofa_kernel-devel
opensm-libs
infiniband-diags
mpitests_openmpi
kernel-mft
ibutils2

ivanallen avatar Apr 24 '25 05:04 ivanallen

Are you using multiple threads? Can you pls check if setting UCX_IB_MMIO_MODE=db_lock fixes the issue? If it does you need to initialize UCP worker with threading support

brminich avatar Apr 24 '25 06:04 brminich

Are you using multiple threads? Can you pls check if setting UCX_IB_MMIO_MODE=db_lock fixes the issue? If it does you need to initialize UCP worker with threading support

Hi @brminich

We only have one thread to poll ucp worker in client and server. (one worker per thread, we use run to completion thread mode)

// worker
_params.thread_mode = UCS_THREAD_MODE_SINGLE;
_params.field_mask |= UCP_WORKER_PARAM_FIELD_THREAD_MODE;
_params.client_id = client_id;
_params.field_mask |= UCP_WORKER_PARAM_FIELD_CLIENT_ID;

// ep
_params = {}; // clear


_params.field_mask |= UCP_EP_PARAM_FIELD_SOCK_ADDR | 
           UCP_EP_PARAM_FIELD_FLAGS | 
           UCP_EP_PARAM_FIELD_ERR_HANDLER | 
           UCP_EP_PARAM_FIELD_ERR_HANDLING_MODE;

_params.flags |= UCP_EP_PARAMS_FLAGS_CLIENT_SERVER | UCP_EP_PARAMS_FLAGS_SEND_CLIENT_ID;

_params.err_mode = UCP_ERR_HANDLING_MODE_PEER;
_params.err_handler.cb = ucx_complete_error_handler;
_params.err_handler.arg = _closure;
_params.sockaddr.addr = reinterpret_cast<struct sockaddr*>(&_address);
_params.sockaddr.addrlen = sizeof(_address);

When I set UCX_IB_MMIO_MODE=db_lock UCX_TLS=rc,tcp UCX_PROTO_INFO=y , I found that the program was blocked.

#0  0x00007ffff609cea2 in pthread_spin_lock@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007fffe522be30 in ucs_spin_lock (lock=<optimized out>) at /root/xrpc/ucx-1.18.0/src/ucs/type/spinlock.h:83
#2  uct_ib_mlx5_post_send (hw_ci_updated=1, wqe_size=<optimized out>, ctrl=<optimized out>, wq=0x512000024838) at /root/xrpc/ucx-1.18.0/src/uct/ib/mlx5/ib_mlx5.inl:588
#3  uct_rc_mlx5_common_post_send (log_sge=0x0, max_log_sge=<optimized out>, imm=0, wqe_size=<optimized out>, dci_channel=0, fm_ce_se=<optimized out>, opmod=0 '\000', opcode=10 '\n', txwq=0x512000024838, txqp=0x5120000247d0, qp_type=2,
    iface=0x52d00041a400) at rc/rc_mlx5.inl:469
#4  uct_rc_mlx5_txqp_dptr_post_iov (max_log_sge=<optimized out>, dci_channel=<optimized out>, fm_ce_se=<optimized out>, av_size=<optimized out>, opaque_mr=<optimized out>, ib_imm_be=<optimized out>, app_ctx=<optimized out>,
    tag=<optimized out>, rkey=<optimized out>, remote_addr=<optimized out>, am_hdr_len=<optimized out>, am_hdr=<optimized out>, am_id=<optimized out>, iovcnt=<optimized out>, iov=<optimized out>, opcode_flags=<optimized out>,
    txwq=<optimized out>, txqp=<optimized out>, qp_type=<optimized out>, iface=<optimized out>) at rc/rc_mlx5.inl:928
#5  uct_rc_mlx5_base_ep_zcopy_post (comp=0x52e00000a398, op_flags=0, handler=<optimized out>, wqe_flags=2 '\002', opaque_mr=0x0, ib_imm_be=0, app_ctx=0, tag=0, rdma_rkey=0, rdma_raddr=0, am_hdr_len=<optimized out>,
    am_hdr=<optimized out>, am_id=<optimized out>, iov_total_length=0, iovcnt=3, iov=0x7fffe34064d0, opcode=10, ep=0x5120000247c0) at rc/rc_mlx5.inl:1941
#6  uct_rc_mlx5_base_ep_am_zcopy (tl_ep=0x5120000247c0, id=<optimized out>, header=<optimized out>, header_length=<optimized out>, iov=0x7fffe34064d0, iovcnt=3, flags=0, comp=0x52e00000a398) at rc/rc_mlx5_ep.c:341
#7  0x00007ffff7759057 in uct_ep_am_zcopy (comp=0x52e00000a398, flags=0, iovcnt=<optimized out>, iov=0x7fffe34064d0, header_length=8, header=0x7fffe34062d8, id=<optimized out>, ep=<optimized out>)
    at /root/xrpc/ucx-1.18.0/src/uct/api/uct.h:3112
#8  ucp_am_eager_multi_zcopy_send_func (lane_shift=<synthetic pointer>, next_iter=0x7fffe3406490, lpriv=<optimized out>, req=0x52e00000a300) at am/eager_multi.c:279
#9  ucp_proto_multi_progress (dt_mask=5, complete_func=<optimized out>, send_func=<optimized out>, mpriv=0x522000042160, req=0x52e00000a300) at /root/xrpc/ucx-1.18.0/src/ucp/proto/proto_multi.inl:177
#10 ucp_proto_multi_zcopy_progress (uct_comp_cb=0x7ffff7758010 <ucp_am_eager_zcopy_completion>, complete_func=<optimized out>, send_func=<optimized out>, dt_mask=5, uct_mem_flags=256, init_func=<optimized out>, mpriv=0x522000042160,
    req=0x52e00000a300) at /root/xrpc/ucx-1.18.0/src/ucp/proto/proto_multi.inl:246
#11 ucp_am_eager_multi_zcopy_proto_progress (self=0x52e00000a3e0) at am/eager_multi.c:296
#12 0x00007ffff787ee95 in ucp_request_try_send (req=0x52e00000a300) at /root/xrpc/ucx-1.18.0/src/ucp/core/ucp_request.inl:307
#13 ucp_request_send (req=<optimized out>) at /root/xrpc/ucx-1.18.0/src/ucp/core/ucp_request.inl:330
#14 ucp_wireup_replay_pending_request (ucp_ep=0x7fffe03eb000, self=0x52e00000a3e0) at wireup/wireup.c:985
#15 ucp_wireup_replay_pending_requests (ucp_ep=ucp_ep@entry=0x7fffe03eb000, tmp_pending_queue=tmp_pending_queue@entry=0x7fffe34067c0) at wireup/wireup.c:997
#16 0x00007ffff787f412 in ucp_wireup_eps_progress (arg=0x7fffe03eb000) at wireup/wireup.c:503
#17 0x00007ffff76897c1 in ucs_callbackq_oneshot_elems_dispatch (cbq=0x50f000005860) at datastruct/callbackq.c:457
#18 ucs_callbackq_proxy_callback (arg=0x50f000005860) at datastruct/callbackq.c:477
#19 0x00007ffff77a2faa in ucs_callbackq_dispatch (cbq=<optimized out>) at /root/xrpc/ucx-1.18.0/src/ucs/datastruct/callbackq.h:215
#20 uct_worker_progress (worker=<optimized out>) at /root/xrpc/ucx-1.18.0/src/uct/api/uct.h:2813
#21 ucp_worker_progress (worker=0x7fffe3cd8800) at core/ucp_worker.c:3033
#22 0x0000000000f652e6 in xrpc::ucp::UcpWorker::progress (this=0x50300007efd0) at src/ucp/ucp_worker.cc:75
#23 0x00000000007325a0 in xrpc::Channel::poll (this=0x51b000055c80) at src/channel.cc:1008
#24 0x000000000059ab0b in pbrpcframework::PressClient::poll (this=0x5080000115a0, thread_id=0) at examples/xrpc/rpc_press/rpc_press_impl.cc:100
#25 0x00000000005a2926 in pbrpcframework::RpcPress::sync_client (this=0x51d000003280) at examples/xrpc/rpc_press/rpc_press_impl.cc:270
#26 0x000000000059faf2 in pbrpcframework::RpcPress::sync_call_thread (arg=0x51d000003280) at examples/xrpc/rpc_press/rpc_press_impl.cc:195
#27 0x00007ffff794966a in asan_thread_start(void*) () from /lib64/libasan.so.8
#28 0x00007ffff6095d22 in start_thread () from /lib64/libc.so.6
#29 0x00007ffff611ad40 in clone3 () from /lib64/libc.so.6

If I use UCX_IB_MMIO_MODE=db_lock UCX_TLS=**rc_v**,tcp UCX_PROTO_INFO=y , it still have problem.

ivanallen avatar Apr 24 '25 07:04 ivanallen

but is your client and/or server using multiple threads? Not for polling worker, but for something else? If yes, pls try to initialize UCP worker mode as UCS_THREAD_MODE_SERIALIZED instead of UCS_THREAD_MODE_SINGLE.

brminich avatar Apr 24 '25 08:04 brminich

@brminich

but is your client and/or server using multiple threads? Not for polling worker, but for something else? If yes, pls try to initialize UCP worker mode as UCS_THREAD_MODE_SERIALIZED instead of UCS_THREAD_MODE_SINGLE.

Yes client/server using multiple threads. Those threads have nothing to do with ucx. Only one thread use ucx worker.

I tried to use UCS_THREAD_MODE_SERIALIZED, but the issue still existed.

ivanallen avatar Apr 24 '25 08:04 ivanallen

@brminich

I found some clues. When I only pass a piece of data (using continuous buffer), only the first 8 bytes of data are corrupted.

-- client log (server response crc check failed)
[2025-04-24 09:29:39.137] [error] [controller.cc:171] desc:Controller done      
  request_id:19368        
  states:SEND,READY_TO_RECV,RECV,SEND_DONE,RECV_DONE,CRC32_SET    
  error_code:22   error_text:RPC CRC32 check failed       
  iov:0x525000901100:4098 
  body:[0A FF 1F 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D]
-- server log
[2025-04-24 09:29:39.136] [124545] [] [debug] worker.cc:334 desc:Received RPC message   
  data_length:4098        
  data_ptr:0x7fc57801ed49 
  hex:[3D 17 00 00 0D 00 00 00 46 47 48 49 4A 4B 4C 4D]   
  is_rndv:false   rpc_meta:request {
  service_name: "xrpc.example.EchoService"
  method_name: "xrpc.example.EchoService.Echo"
}
correlation_id: 19368
rpc_crc32: 700057043

ivanallen avatar Apr 24 '25 09:04 ivanallen

I tried several times and found that the data that was corrupted had certain patterns. In the same process, the corrupted data is the same, and a new value is generated each time the system restarts.

3D 17 00 00 0D 00 00 00 46 47 48 49 4A 4B 4C 4D
^  ^  ^  ^  ^  ^  ^  ^
62 27 00 00 0D 00 00 00 46 47 48 49 4A 4B 4C 4D
^  ^  ^  ^  ^  ^  ^  ^
23 17 00 00 0D 00 00 00 46 47 48 49 4A 4B 4C 4D
^  ^  ^  ^  ^  ^  ^  ^

The captured packet shows that the data was already incorrect during transmission: Image

The corresponding log shows that the data is still correct when ibv_post_send is sent. After the RPC returns, the data is still correct:

[2025-04-24 10:52:49.307] [debug] [channel.cc:95] desc:issue rpc        request_id:18338        body_length:4098        send_buffer_ptr:0x5250016bc100  body:[0A FF 1F 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D]
[1745491969.307624] [node-57:205502:1]          ucp_am.c:1036 UCX  REQ   allocated request 0x52e00000a300
[1745491969.307638] [node-57:205502:1]    proto_common.c:719  UCX  REQ   req 0x52e00000a300: am_send_reply(multi) from host memory length 4179 zero-copy rc_verbs/mlx5_bond_0:1/path0 6185.2 MB/s 0.64 us
[1745491969.307643] [node-57:205502:1]  proto_common.inl:172  UCX  REQ   req 0x52e00000a300: set to stage 0, progress function '???'
[1745491969.307648] [node-57:205502:1]     ucp_request.c:740  UCX  REQ   req 0x52e00000a300: progress am/egr/single/zcopy/reply {???} ep_cfg[3] rkey_cfg[255] offset 0/4098
[1745491969.307652] [node-57:205502:1]  proto_common.inl:73   UCX  REQ     req 0x52e00000a300: ucp_proto_request_zcopy_init for am/egr/single/zcopy/reply
[1745491969.307658] [node-57:205502:1]        rcache.inl:43   UCX  TRACE   ucp_rcache: lru remove region 0x50f0000aff90 [0x5250016bc100..0x5250016bd110] gt rw ref 2 md[0]=mlx5_bond_0
[1745491969.307663] [node-57:205502:1]        ucp_mm.inl:32   UCX  TRACE   memh 0x50f0000aff90: address 0x5250016bc100/0x5250016bc100 length 4098/4112 host md_map 1 obtained from rcache
[1745491969.307676] [node-57:205502:1]     rc_verbs_ep.c:54   UCX  DATA    QP 0x8f5f wrid 0x47a5 SEND s-e [va 0x7f74954c7e80[1a 12 00 01] len 9 lkey 0xecab] [va 0x5250016bc100[0a ff 1f 41] len 4098 lkey 0x48c062] [va 0x7f7496dfdf80[58 50
52 43] len 89 lkey 0x511cbc] -- am 26
[1745491969.307681] [node-57:205502:1]     ucp_request.c:756  UCX  REQ     req 0x52e00000a300: progress protocol am/egr/single/zcopy/reply returned: Success
[1745491969.307685] [node-57:205502:1]  proto_common.inl:256  UCX  DATA  request 0x52e00000a300 send.cb set to 0xf4c9cb, user data: 0x50a0027bfd40
[1745491969.307692] [node-57:205502:1]  proto_common.inl:261  UCX  REQ   returning send request 0x52e00000a300: am_send_reply 0/4098 contiguous buffer:0x5250016bc100
[1745491969.307699] [node-57:205502:1]        ucp_mm.inl:92   UCX  TRACE memh 0x50f0000aff90: release address 0x5250016bc100 length 4112 md_map 1
[1745491969.307705] [node-57:205502:1]        rcache.inl:29   UCX  TRACE ucp_rcache: lru add region 0x50f0000aff90 [0x5250016bc100..0x5250016bd110] gt rw ref 2 md[0]=mlx5_bond_0
[1745491969.307709] [node-57:205502:1]   ucp_request.inl:246  UCX  REQ   completing send request 0x52e00000a300 (0x52e00000a410) ---c-- Success
[1745491969.307715] [node-57:205502:1]     ucp_request.c:215  UCX  REQ   free request 0x52e00000a300 (0x52e00000a410) d--c--
[1745491969.307718] [node-57:205502:1]   ucp_request.inl:236  UCX  REQ   put request 0x52e00000a300
[1745491969.307981] [node-57:205502:1]   rc_verbs_impl.h:108  UCX  DATA  RECV qp 0x8f5f [va 0x7f749865b1c0 len 53] -- am 23
[2025-04-24 10:52:49.307] [debug] [channel.cc:350] desc:Received AM RPC message header_size:44  data_size:0
[2025-04-24 10:52:49.308] [debug] [channel.cc:388] desc:Received RPC message    body_size:0     is_rndv:false   rpc_meta:response {
  error_code: 22
  error_text: "RPC CRC32 check failed"
}
correlation_id: 18338

[2025-04-24 10:52:49.308] [error] [controller.cc:171] desc:Controller done      request_id:18338        states:SEND,READY_TO_RECV,RECV,SEND_DONE,RECV_DONE,CRC32_SET    error_code:22   error_text:RPC CRC32 check failed       iov:0x5250016b
c100:4098 body:[0A FF 1F 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D]

ivanallen avatar Apr 24 '25 11:04 ivanallen

@brminich

A new clue: The buffer sent by ibv_post_send is actually not the one seen in the log, but a buffer that has been released.

The request sending is serial. The next one will only be sent after receiving the result from the server. Theoretically, this situation won't occur. I don't know what happened.

These Buffers will be returned to the system only after the request result is obtained. I'm not sure how ucx manages the rcache. What will happen if the buffer is returned to the system but still exists in the rcache?

ivanallen avatar Apr 25 '25 03:04 ivanallen

--> client
[2025-04-25 05:53:56.702] [36789] request_id:2506 alloc_buffer:0x52500190a100 buffer:"XXXXXXXX 2506 ABCDE..."
[2025-04-25 05:53:56.702] [36789] request_id:2506 receive server response and release 0x52500190a100
[1745560436.702221] [node-57:36789:0]     rc_verbs_ep.c:54   UCX  DATA    QP 0x1b7e wrid 0x9cd SEND s-e 
  [va 0x7f14416e8e80[1a 12 00 01] len 9 lkey 0x121df] 
  [va 0x52500190a100[0a 8e 20 58] len 4113 lkey 0x290b89] 
  [va 0x7f1442ffdf80[58 50 52 43] len 88 lkey 0x1bc9c3] -- am 26

...

[2025-04-25 05:54:05.456] [36789] request_id:7682 alloc_buffer:0x52500190a100 buffer:"XXXXXXXX 7802 ABCDE..."
[1745560445.456262] [node-57:36789:0]     rc_verbs_ep.c:54   UCX  DATA    QP 0x1b7e wrid 0x1e05 SEND s-e 
  [va 0x7f14416e8e80[1a 12 00 01] len 9 lkey 0x121df] 
  [va 0x52500190a100[0a 8e 20 58] len 4113 lkey 0x290b89] 
  [va 0x7f1442ffdf80[58 50 52 43] len 88 lkey 0x1bc9c3] -- am 26

<-- server
[2025-04-25 05:54:05.456] [17291] worker.cc:418 request_id:7682 desc:CRC32 check failed
    crc:4273242618  expected:891675410
    hex:[E7 10 00 00 00 00 00 00 58 58 58 20 32 35 30 36]  "_____XXX 2506"

The request(7682) received the buffer data of request(2506).

I guess that request(2056) has released the buffer, but the physical memory pointed to by the lkey is still the physical address of the buffer of request(2056). While request(7682) has obtained the same linear address, its physical memory has changed, but the lkey it uses remains unchanged. All of this led to the sending of old data.

When I allocated memory using the buffer pool, the problem no longer occurred.

@brminich @yosefe

ivanallen avatar Apr 25 '25 06:04 ivanallen

I'd need a few more details about your application. When you release the buffer, it should be unregistered and removed from the rcache.

Could you clarify the following:

  • How do you allocate and release memory at the point where the issue occurs?
  • Which buffer pool do you use when the problem goes away?
  • Are you calling any UCX functions from another thread (e.g., ucp_mem_map or similar routines)? Not necessarily UCP worker related

brminich avatar Apr 25 '25 14:04 brminich

@brminich

  1. Our usage method is similar to ucp_client_server, using the C++ new/delete method to dynamically allocate and release memory. This approach will cause RDMA to send old data, which will result in errors.

I learned that UCX dynamically registers a memory buffer, but this is transparent to me. I don't know how to notify UCX to unregister memory to remove from rcache after releasing it.

  1. The buffer pool (a structure written by ourselves) will cache the requested memory and not return it to the os. If used in this way, there won't be any problems.

  2. In our program, ucp_mem_map is not used. Only one thread will use the API related to UCX, while the other threads are all related to background management, such as some monitoring programs.

ivanallen avatar Apr 26 '25 07:04 ivanallen