ucx icon indicating copy to clipboard operation
ucx copied to clipboard

request not complete

Open luhp opened this issue 4 months ago • 2 comments

Describe the bug

When using tag-based communication, if a server-side request has received partial fragmented data and the peer disconnects, the remaining fragments will never arrive. This causes the request to never be completed, and no error is returned—it remains stuck in the tm->frag_hashindefinitely. Even attempts to cancel the request may not take effect.

Setup and versions

# ucx_info -v
# Library version: 1.18.0
# Library path: /usr/lib64/libucs.so.0
# API headers version: 1.18.0
# Git branch '', revision 152bf42
# Configured with: --build=x86_64-culinux-linux-gnu --host=x86_64-culinux-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-mlx5 --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --without-mad --without-ze --disable-numa

log

[1756465240.257843] [ebsnvme0002:784957:1] tag_match.c:136 UCX REQ matched received tag 16fc04978 to req 0x7fffc1a08c40 [1756465240.257846] [ebsnvme0002:784957:1] eager_rcv.c:27 UCX REQ found req 0x7fffc1a08c40 [1756465240.257849] [ebsnvme0002:784957:1] ucp_request.inl:588 UCX REQ req 0x7fffc1a08c40: unpack recv_data req_len 65536 data_len 8230 offset 0 [1756465240.258085] [ebsnvme0002:784957:1] rc_mlx5.inl:399 UCX DATA RECV qp 0xab64 [va 0x7fff9b1236c0 len 8256] -- am 4 EGR_M msgid af481236d392a129 offset 8230 [1756465240.258089] [ebsnvme0002:784957:1] ucp_request.inl:588 UCX REQ req 0x7fffc1a08c40: unpack recv_data req_len 65536 data_len 8238 offset 8230 [1756465240.258314] [ebsnvme0002:784957:1] rc_mlx5.inl:399 UCX DATA RECV qp 0xab64 [va 0x7fff9b121600 len 8256] -- am 4 EGR_M msgid af481236d392a129 offset 16468 [1756465240.258318] [ebsnvme0002:784957:1] ucp_request.inl:588 UCX REQ req 0x7fffc1a08c40: unpack recv_data req_len 65536 data_len 8238 offset 16468 [1756465240.258544] [ebsnvme0002:784957:1] rc_mlx5.inl:399 UCX DATA RECV qp 0xab64 [va 0x7fff9b11f540 len 8256] -- am 4 EGR_M msgid af481236d392a129 offset 24706 [1756465240.258548] [ebsnvme0002:784957:1] ucp_request.inl:588 UCX REQ req 0x7fffc1a08c40: unpack recv_data req_len 65536 data_len 8238 offset 24706 [1756465240.258774] [ebsnvme0002:784957:1] rc_mlx5.inl:399 UCX DATA RECV qp 0xab64 [va 0x7fff9b11d480 len 8256] -- am 4 EGR_M msgid af481236d392a129 offset 32944 [1756465240.258778] [ebsnvme0002:784957:1] ucp_request.inl:588 UCX REQ req 0x7fffc1a08c40: unpack recv_data req_len 65536 data_len 8238 offset 32944 [1756465240.259098] [ebsnvme0002:784957:1] rc_mlx5.inl:399 UCX DATA RECV qp 0xab64 [va 0x7fff9b11b3c0 len 8256] -- am 4 EGR_M msgid af481236d392a129 offset 41182 [1756465240.259102] [ebsnvme0002:784957:1] ucp_request.inl:588 UCX REQ req 0x7fffc1a08c40: unpack recv_data req_len 65536 data_len 8238 offset 41182 [1756465240.259176] [ebsnvme0002:784957:a] rdmacm_cm.c:739 UCX TRACE rdmacm event (fd=107 cm_id 0x7fff9c000ce0 cm 0x7fffc19da320 event_channel 0x7fffc14ff660 status Success (0)): RDMA_CM_EVENT_DISCONNECTED. Peer: 172.25.8.103:40754. [1756465240.259230] [ebsnvme0002:784957:a] rdmacm_cm.c:659 UCX DEBUG [cep 0x7fffc11fefa0 172.25.8.102:30010->172.25.8.103:40754 server Success] got disconnect event, status Success (0) [1756465240.259260] [ebsnvme0002:784957:a] wireup_cm.c:948 UCX TRACE ep 0x7ffff7292000 flags 0x3324293: remote disconnect callback invoked [1756465240.259273] [ebsnvme0002:784957:a] rdmacm_cm.c:35 UCX TRACE ack event 0x7fff9c001690, cm_id 0x7fff9c000ce0 [1756465240.259309] [ebsnvme0002:784957:1] wireup_cm.c:895 UCX TRACE ep 0x7ffff7292000: got remote disconnect, cm_ep 0x7fffc11fefa0, flags 0x3324293 [1756465240.259316] [ebsnvme0002:784957:1] wireup_cm.c:853 UCX TRACE ep 0x7ffff7292000: flags 0x3324293 cm_remote_disconnect_progress [1756465240.259324] [ebsnvme0002:784957:1] ucp_ep.c:1471 UCX DEBUG ep 0x7ffff7292000: set_ep_failed status Connection reset by remote peer on lane[0]=0x7fffc11fefa0 [1756465240.259404] [ebsnvme0002:784957:1] rdmacm_cm_ep.c:769 UCX DEBUG [cep 0x7fffc11fefa0 172.25.8.102:30010->172.25.8.103:40754 server Success]: (id=0x7fff9c000ce0) disconnected from peer 172.25.8.103:40754 [1756465240.259410] [ebsnvme0002:784957:1] ucp_ep.c:1434 UCX DEBUG ep 0x7ffff7292000: discarding lanes [1756465240.259416] [ebsnvme0002:784957:1] ucp_ep.c:1442 UCX DEBUG ep 0x7ffff7292000: discard uct_ep[0]=0x7fffc11fefa0 [1756465240.259423] [ebsnvme0002:784957:1] ucp_worker.c:3558 UCX REQ allocated request 0x7fffc1a08dc0 [1756465240.259430] [ebsnvme0002:784957:1] ucp_worker.c:3588 UCX DATA request 0x7fffc1a08dc0 send.cb set to 0x7ffff7e63610, user data: 0x7fffc1f0c5a0 [1756465240.259435] [ebsnvme0002:784957:1] ucp_worker.c:2716 UCX REQ req 0x7fffc1a08dc0: discard_uct_ep flush completion status Success [1756465240.259441] [ebsnvme0002:784957:1] ucp_ep.c:1442 UCX DEBUG ep 0x7ffff7292000: discard uct_ep[1]=0x7fffc08c89f0 [1756465240.259445] [ebsnvme0002:784957:1] ucp_worker.c:3558 UCX REQ allocated request 0x7fffc1a08f40 [1756465240.259449] [ebsnvme0002:784957:1] ucp_worker.c:3588 UCX DATA request 0x7fffc1a08f40 send.cb set to 0x7ffff7e63610, user data: 0x7fffc1f0c5a0 [1756465240.259456] [ebsnvme0002:784957:1] rc_mlx5.inl:463 UCX DATA QP 0xab64 [944] NOP sf- [1756465240.259466] [ebsnvme0002:784957:1] ib_mlx5.c:910 UCX DEBUG device uverbs0: modify QP 0x7fffc08c8a68 num 0xab64 to state 6 [1756465240.260041] [ebsnvme0002:784957:1] mpool.c:281 UCX DEBUG mpool send-ops-mpool: allocated chunk 0x7fffc00065d0 of 16472 bytes with 256 elements [1756465240.260049] [ebsnvme0002:784957:1] ucp_ep.c:1442 UCX DEBUG ep 0x7ffff7292000: discard uct_ep[2]=0x7fffc10144b0 [1756465240.260054] [ebsnvme0002:784957:1] ucp_worker.c:3558 UCX REQ allocated request 0x7fffc1a090c0 [1756465240.260058] [ebsnvme0002:784957:1] ucp_worker.c:3588 UCX DATA request 0x7fffc1a090c0 send.cb set to 0x7ffff7e63610, user data: 0x7fffc1f0c5a0 [1756465240.260065] [ebsnvme0002:784957:1] ucp_worker.c:2716 UCX REQ req 0x7fffc1a090c0: discard_uct_ep flush completion status Success [1756465240.260072] [ebsnvme0002:784957:1] ucp_ep.c:3462 UCX DEBUG ep 0x7ffff7292000: calling user error callback 0x53ad9a with arg 0x7fffc00022c0 and status Connection reset by remote peer [1756465240.260101] [ebsnvme0002:784957:a] ib_device.c:468 UCX DEBUG IB Async event on mlx5_0: SRQ-attached QP 0xab64 was flushed

[1756465243.263595] [ebsnvme0002:784957:a] rdmacm_cm.c:739 UCX TRACE rdmacm event (fd=107 cm_id 0x7fff9c000ce0 cm 0x7fffc19da320 event_channel 0x7fffc14ff660 status Success (0)): RDMA_CM_EVENT_TIMEWAIT_EXIT. Peer: 172.25.8.103:40754. [1756465243.263616] [ebsnvme0002:784957:a] rdmacm_cm.c:35 UCX TRACE ack event 0x7fff9c001690, cm_id 0x7fff9c000ce0 [1756465243.264017] [ebsnvme0002:784957:1] tag_match.c:93 UCX REQ can't remove req 0x7fffc1a08c40 (already matched)

luhp avatar Sep 10 '25 07:09 luhp

--ucp_eager_tagged_handler
	-- ucp_tag_exp_search  // request remove from expect queue
	-- if UCP_RECV_DESC_FLAG_EAGER_ONLY 
		-- ucp_request_recv_data_unpack
		-- ucp_request_complete_tag_recv
	-- else
		-- ucp_request_process_recv_data
			-- handle data
		-- status == UCS_INPROGRESS
			-- ucp_tag_frag_list_process_queue
			    
				-- ucp_tag_frag_hash_init_exp // /* request not completed, put it on the frag_hash */

luhp avatar Sep 10 '25 07:09 luhp

Hi @luhp thanks for reporting the issue. Do you have a minimal reproducer? Do you use HW tag-matching? Is error handling enabled when creating the UCP EP?

gleon99 avatar Sep 21 '25 09:09 gleon99