When using shared memory communication, ucp_am_send_nbx hangs and callback not invoked
Describe the bug
We write an RPC framework based on am. The client of the rpc framework uses ucp_am_send_nbx to send the request message. After the server processes the request message, the response message is sent to the client through ucp_am_send_nbx.
When I was testing UCX_TLS=shm,tcp with rpc_press(a tool to perf RPC framework), it was possible for the press tool to hang up when I broke it with ctrl c and then run rpc_press again.
At this point, I use another simple echo client to send a request to the server, which also hang up and get no response.
After analysis, we found that the client's ucp_am_send_nbx callback was also not executed.
If I use UCX_TLS=tcp, it works. However, UCX_TLS=shm,tcp does not work.
To recover, you must restart the server.
Steps to Reproduce
- Command line
xrpc_server is a simple echo server based on ucx.
UCX_PROTO_INFO=y UCX_TCP_CM_REUSEADDR=y UCX_MM_ERROR_HANDLING=y ./build/linux/x86_64/releasedbg/xrpc_server --attachment_size=0 --xrpc_mem_resource=default --log_level info --attachment_size 0 --num_workers=8 --enable
_wakeup_mode=false
rpc_press starts 32 threads to connect the xrpc_server and send request.
UCX_TLS=shm,tcp UCX_PROTO_INFO=y UCX_MM_ERROR_HANDLING=y UCX_LOG_LEVEL=info ./build/linux/x86_64/releasedbg/rpc_press -proto ./examples/xrpc/demo/echo.proto -method=example.EchoService.Echo -server 10.16.26.223:8123 -input ./input.json -qps 0 --duration 190 -attachment_size 4096 --xrpc_mem_resource=default --thread_num 32 --qps 0
output:
[2024-12-12 11:28:33.623] [info] [channel.cc:629] desc:Connected to worker worker:7 host:10.16.26.223 port:8123
[1734002913.623916] [node223:41158:32] +-------------------------------+----------------------------------------------------------------------------------+
[1734002913.623925] [node223:41158:32] | xrpc_channel intra-node cfg#0 | active message by ucp_am_send* with reply flag(multi) from iov[2] host memory |
[1734002913.623930] [node223:41158:32] +-------------------------------+--------------------------------------------------------------------+-------------+
[1734002913.623934] [node223:41158:32] | 0..8240 | copy-in | sysv/memory |
[1734002913.623939] [node223:41158:32] | 8241..593437 | multi-frag copy-in | sysv/memory |
[1734002913.623943] [node223:41158:32] | 593438..inf | (?) rendezvous fragmented copy-in copy-out | sysv/memory |
[1734002913.623948] [node223:41158:32] +-------------------------------+--------------------------------------------------------------------+-------------+
2024/12/12-11:28:34 sent:1532179 success:1532205 error:0 total_error:0 total_sent:1839265
2024/12/12-11:28:35 sent:2371318 success:2371316 error:0 total_error:0 total_sent:4210583
2024/12/12-11:28:36 sent:2404221 success:2404222 error:0 total_error:0 total_sent:6614804
2024/12/12-11:28:37 sent:2400176 success:2400176 error:0 total_error:0 total_sent:9014980
2024/12/12-11:28:38 sent:2397321 success:2397320 error:0 total_error:0 total_sent:11412301
2024/12/12-11:28:39 sent:2330092 success:2330096 error:0 total_error:0 total_sent:13742393
[Latency]
avg 23 us
50% 12 us
70% 13 us
90% 15 us
95% 16 us
97% 17 us
99% 20 us
99.9% 31 us
99.99% 549 us
max 4586555 us
^C
start rpc_press again, but it hangs!
UCX_TLS=shm,tcp UCX_PROTO_INFO=y UCX_MM_ERROR_HANDLING=y UCX_LOG_LEVEL=info ./build/linux/x86_64/releasedbg/rpc_press -proto ./examples/xrpc/demo/echo.proto -method=example.EchoService.Echo -server 10.16.26.223:8123 -input ./input.json -qps 0 --duration 190 -attachment_size 4096 --xrpc_mem_resource=default --thread_num 32 --qps 0
use xrpc_client to send the request and hangs too!
UCX_TLS=shm,tcp UCX_PROTO_INFO=y UCX_MM_ERROR_HANDLING=y UCX_LOG_LEVEL=data ./build/linux/x86_64/releasedbg/xrpc_client --host 10.16.26.223 --port 8123
- UCX version used: ucx 1.17
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
Linux node223 4.18.0-305.3.1.el8_4.x86_64 #1 SMP Thu Jun 17 07:52:48 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
ucx log: ucx_log.txt
Hi, in order to fully support ctrl-c, the connection with UCX should be done using an IP address, for example see https://github.com/openucx/ucx/blob/master/examples/ucp_client_server.c and https://github.com/openucx/ucx/blob/79de87a3c8d4d58336fcc9bf93fbfcdbab34dc62/test/apps/iodemo/ucx_wrapper.cc#L813-L840 is this what being done in the RPC framework?
@yosefe Thank you for your attention to my question.
Strictly speaking, I use the same methods in ucp_hello_world and ucx_perftest.
First, the client will use an extra OOB TCP connection to request the server to obtain the address of all workers, and then use it to create all ucp_ep.
_params.field_mask |= UCP_EP_PARAM_FIELD_REMOTE_ADDRESS;
_params.address = worker_address
The client establishes a full connection with all the server workers so that it can hash the request to the specified worker thread for processing.
Does this method not support using ctrl-c to interrupt the process?
If not, should I create ucp_listeners that use different port for all threads?
Does this method not support using
ctrl-cto interrupt the process? If not, should I createucp_listeners that use different port for all threads?
Not, it doesn't. in the above use case, you could still create a single listener object, and when it accepts a new conn_request create the server endpoint on a worker of choice.
Does this method not support using
ctrl-cto interrupt the process? If not, should I createucp_listeners that use different port for all threads?Not, it doesn't. in the above use case, you could still create a single listener object, and when it accepts a new conn_request create the server endpoint on a worker of choice.
Hi @yosefe, Thanks for answering.
But by doing so, I don't know how to hash the request to the specified worker?I cannot know which server worker corresponds to the ep created on the client. Do you have any good ideas?
@ivanallen 2 possible options are to extract the client address or client_id from the connection request, see https://github.com/openucx/ucx/blob/fa01cca77754edb7dd510190640c5feb9fb2b366/src/ucp/api/ucp.h#L1520C3-L1520C26
@ivanallen 2 possible options are to extract the client address or client_id from the connection request, see https://github.com/openucx/ucx/blob/fa01cca77754edb7dd510190640c5feb9fb2b366/src/ucp/api/ucp.h#L1520C3-L1520C26
Hi @yosefe
Yes, I know this API. But this can only tell which worker in client the request is coming from. But as a client, I don't know how to send a request to the specified worker in the server.
Such as in the following picture. work0 in the client would like to send a request to work0 in the server.
I see, so to have precise control on the client can have multiple listeners on different ports.
I see, so to have precise control on the client can have multiple listeners on different ports.
@yosefe Thank you very much.
Finally, I would like to ask why ctrl-c is not supported for connection using the worker address, is there any concern about this? Or do you have any information to refer to?
I see, so to have precise control on the client can have multiple listeners on different ports.
@yosefe Thank you very much.
Finally, I would like to ask why
ctrl-cis not supported for connection using the worker address, is there any concern about this? Or do you have any information to refer to?
In order to respond to ctrl-c, there must be a kernel-based "channel" to send a FIN message and let the other side to detect the connection is closed. By the current design, that kind of channel is created in UCX only with IP/port based connection establishment.
Hi @yosefe
I have replaced the code and upgraded to 1.18.0-rc2, but I find that the problem still exists and there is a difference.
Now the callback ucp_am_send_nbx on the client side can be executed, but the am handler on the server side does not trigger the execution.
Of the server's eight workers, there are always individual workers who cannot receive requests. This only occurs at extremely high QPS.
The following is the log(ucx log reference attachment). You can see that request_id:1 of the client is sent successfully, but request_id:2 does not respond.
request_id:1 requests worker0 in the server. This channel is normal. But request_id:2 uses worker1 in the server, and the channel is not responding. (we use round robin to send requests to different server workers.)
client log:
[2024-12-17 04:15:16.065] [66723] [] [debug] channel.cc:238 desc:Received AM RPC message header_size:16 data_size:15
[2024-12-17 04:15:16.066] [66723] [] [debug] channel.cc:276 desc:Received RPC message body_size:15 is_rndv:false rpc_meta:response {
}
correlation_id: 1
[2024-12-17 04:15:16.066] [66723] [] [info] client.cc:104 request_id:1 sync test latency:740707µs actual:740836181ns
[2024-12-17 04:15:16.066] [66723] [] [info] client.cc:112 request_id:1 Received response:Hello, world!
[2024-12-17 04:15:16.066] [66723] [] [debug] channel.cc:427 desc:Sending RPC message rpc_header:RpcHeader{magic:1129467992, meta_size:53, body_size:41} rpc_meta:request {
service_name: "example.EchoService"
method_name: "example.EchoService.Echo"
}
correlation_id: 2
attachment_size: 26
[2024-12-17 04:15:18.325] [66723] [] [warning] channel.cc:491 desc:request slow client_id:0 request_id:2 q:inflight states:SEND,READY_TO_RECV,SEND_DONE elapse:2s
[2024-12-17 04:15:19.325] [66723] [] [warning] channel.cc:491 desc:request slow client_id:0 request_id:2 q:inflight states:SEND,READY_TO_RECV,SEND_DONE elapse:3s
[2024-12-17 04:15:20.325] [66723] [] [warning] channel.cc:491 desc:request slow client_id:0 request_id:2 q:inflight states:SEND,READY_TO_RECV,SEND_DONE elapse:4s
server log:
[2024-12-17 04:15:15.325] [66632] [] [info] message_server.cc:42 desc:session come in remote:10.16.26.223:49326
[2024-12-17 04:15:15.325] [66632] [] [info] server.cc:129 desc:Handshake message received msg:hello xrpc
[2024-12-17 04:15:15.473] [66634] [] [info] worker.cc:101 desc:connection request received client:[email protected]:39326
[2024-12-17 04:15:15.480] [66637] [] [info] worker.cc:101 desc:connection request received client:[email protected]:49514
[2024-12-17 04:15:15.495] [66636] [] [info] worker.cc:101 desc:connection request received client:[email protected]:49790
[2024-12-17 04:15:15.510] [66638] [] [info] worker.cc:101 desc:connection request received client:[email protected]:36860
[2024-12-17 04:15:15.524] [66635] [] [info] worker.cc:101 desc:connection request received client:[email protected]:33430
[2024-12-17 04:15:15.539] [66633] [] [info] worker.cc:101 desc:connection request received client:[email protected]:41722
[2024-12-17 04:15:15.558] [66640] [] [info] worker.cc:101 desc:connection request received client:[email protected]:57184
[2024-12-17 04:15:15.578] [66639] [] [info] worker.cc:101 desc:connection request received client:[email protected]:38456
[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:194 desc:Received RPC message data_length:41 data_ptr:0x7fbe9036fec8 is_rndv:false rpc_meta:request {
service_name: "example.EchoService"
method_name: "example.EchoService.Echo"
}
correlation_id: 1
attachment_size: 26
[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:362 desc:return RPC message rpc_header:RpcHeader{magic:1129467992, meta_size:4, body_size:15} worker_id:0 request_id:1 states:RECV,RECV_DONE rpc_meta:response {
}
correlation_id: 1
[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:422 desc:client specified send flags send_flags:0
[2024-12-17 04:15:16.065] [66633] [] [debug] worker.cc:412 desc:RPC message return success worker_id:0 request_id:1 states:SEND,RECV,SEND_DONE,RECV_DONE
[2024-12-17 04:15:21.041] [66639] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:38456
[2024-12-17 04:15:21.041] [66640] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:57184
[2024-12-17 04:15:21.041] [66635] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:33430
[2024-12-17 04:15:21.041] [66634] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:39326
[2024-12-17 04:15:21.041] [66636] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:49790
[2024-12-17 04:15:21.041] [66637] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:49514
[2024-12-17 04:15:21.041] [66638] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:36860
[2024-12-17 04:15:21.041] [66633] [] [warning] worker.cc:106 desc:connection error status:Connection reset by remote peer code:-25 client:[email protected]:41722
[2024-12-17 04:15:21.041] [66632] [] [info] message_server.cc:38 desc:session close remote:10.16.26.223:49326
client ucx log: ucx.log2.txt
(gdb) p iface->read_index_elem[0]
$23 = {
flags = 1 '\001',
am_id = 26 '\032',
length = 4193,
desc = {
seg_id = 9732102,
seg_size = 6291456,
offset = 5265344
},
desc_data = 0x7f2d9a5057c0
}
(gdb) p iface->read_index_elem[1]
$24 = {
flags = 0 '\000',
am_id = 0 '\000',
length = 0,
desc = {
seg_id = 0,
seg_size = 0,
offset = 0
},
desc_data = 0x0
}
(gdb) p iface->recv_fifo_ctl[0]
$25 = {
head = 239249,
signal_addrlen = 8,
signal_sockaddr = {
sun_family = 1,
sun_path = "\000\060\062c9b", '\000' <repeats 101 times>
},
pad0 = "\000\000\000\000\000",
tail = 239104,
pid = 87383
}
(gdb) p iface->read_index
$26 = 239204
(gdb)
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)(((iface->read_index & iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$28 = {
flags = 1 '\001',
am_id = 26 '\032',
length = 4193,
desc = {
seg_id = 9732102,
seg_size = 6291456,
offset = 5265344
},
desc_data = 0x7f2d9a5057c0
}
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)((((iface->read_index + 1)& iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$29 = {
flags = 0 '\000',
am_id = 1 '\001',
length = 77,
desc = {
seg_id = 9732102,
seg_size = 6291456,
offset = 4904832
},
desc_data = 0x7f2d9a4ad780
}
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)((((iface->read_index + 2)& iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$30 = {
flags = 0 '\000',
am_id = 1 '\001',
length = 30,
desc = {
seg_id = 9732102,
seg_size = 6291456,
offset = 5055744
},
desc_data = 0x7f2d9a4d2500
}
(gdb) p ((uct_mm_fifo_element_t *)((void *)((intptr_t)(iface->recv_fifo_elems) + (intptr_t)((((iface->read_index + 3)& iface->fifo_mask)) * (iface)->config.fifo_elem_size))))[0]
$31 = {
flags = 0 '\000',
am_id = 26 '\032',
length = 4191,
desc = {
seg_id = 9732102,
seg_size = 6291456,
offset = 6103744
},
desc_data = 0x7f2d9a5d22c0
}
(gdb) p iface[0]
$32 = {
super = {
super = {
super = {
ops = {
ep_put_short = 0x5407a0 <uct_sm_ep_put_short>,
ep_put_bcopy = 0x5407d0 <uct_sm_ep_put_bcopy>,
ep_put_zcopy = 0x0,
ep_get_short = 0x0,
ep_get_bcopy = 0x5407e0 <uct_sm_ep_get_bcopy>,
ep_get_zcopy = 0x0,
ep_am_short = 0x543250 <uct_mm_ep_am_short>,
ep_am_short_iov = 0x543470 <uct_mm_ep_am_short_iov>,
ep_am_bcopy = 0x543750 <uct_mm_ep_am_bcopy>,
ep_am_zcopy = 0x0,
ep_atomic_cswap64 = 0x540a50 <uct_sm_ep_atomic_cswap64>,
ep_atomic_cswap32 = 0x540a70 <uct_sm_ep_atomic_cswap32>,
ep_atomic32_post = 0x540810 <uct_sm_ep_atomic32_post>,
ep_atomic64_post = 0x540880 <uct_sm_ep_atomic64_post>,
ep_atomic32_fetch = 0x5409a0 <uct_sm_ep_atomic32_fetch>,
ep_atomic64_fetch = 0x5408f0 <uct_sm_ep_atomic64_fetch>,
ep_tag_eager_short = 0x0,
ep_tag_eager_bcopy = 0x0,
ep_tag_eager_zcopy = 0x0,
ep_tag_rndv_zcopy = 0x0,
ep_tag_rndv_cancel = 0x0,
ep_tag_rndv_request = 0x0,
iface_tag_recv_zcopy = 0x0,
iface_tag_recv_cancel = 0x0,
ep_pending_add = 0x543a10 <uct_mm_ep_pending_add>,
ep_pending_purge = 0x543b40 <uct_mm_ep_pending_purge>,
ep_flush = 0x543ba0 <uct_mm_ep_flush>,
ep_fence = 0x540c00 <uct_sm_ep_fence>,
ep_check = 0x543bf0 <uct_mm_ep_check>,
ep_create = 0x543170 <uct_mm_ep_t_new>,
ep_connect = 0x0,
ep_disconnect = 0x0,
cm_ep_conn_notify = 0x0,
ep_destroy = 0x543220 <uct_mm_ep_t_delete>,
ep_get_address = 0x0,
ep_connect_to_ep = 0x0,
iface_accept = 0x0,
iface_reject = 0x0,
iface_flush = 0x540c60 <uct_mm_iface_flush>,
iface_fence = 0x540bf0 <uct_sm_iface_fence>,
iface_progress_enable = 0x53dfa0 <uct_base_iface_progress_enable>,
iface_progress_disable = 0x53dfb0 <uct_base_iface_progress_disable>,
iface_progress = 0x541390 <uct_mm_iface_progress>,
iface_event_fd_get = 0x540c80 <uct_mm_iface_event_fd_get>,
iface_event_arm = 0x5410a0 <uct_mm_iface_event_fd_arm>,
iface_close = 0x541070 <uct_mm_iface_t_delete>,
iface_query = 0x540ea0 <uct_mm_iface_query>,
iface_get_device_address = 0x540bc0 <uct_sm_iface_get_device_address>,
iface_get_address = 0x540c30 <uct_mm_iface_get_address>,
iface_is_reachable = 0x53dc20 <uct_base_iface_is_reachable>
}
},
internal_ops = 0x85c020 <uct_mm_iface_internal_ops>,
--Type <RET> for more, q to quit, c to continue without paging--
md = 0x25f7280,
worker = 0x27fabb0,
am = {{
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0x0,
flags = 4
}, {
cb = 0x50b7b0 <ucp_wireup_msg_handler>,
arg = 0x305a000,
flags = 4
}, {
cb = 0x4d4dc0 <ucp_eager_only_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4d5ee0 <ucp_eager_first_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4d7230 <ucp_eager_middle_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0x5,
flags = 4
}, {
cb = 0x4d5650 <ucp_eager_sync_only_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4d6880 <ucp_eager_sync_first_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4d4460 <ucp_eager_sync_ack_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4b9f20 <ucp_rndv_rts_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4bfe40 <ucp_rndv_ats_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4c3c10 <ucp_rndv_rtr_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4bf150 <ucp_rndv_data_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0xd,
flags = 4
--Type <RET> for more, q to quit, c to continue without paging--
}, {
cb = 0x4d8320 <ucp_eager_offload_sync_ack_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0xf,
flags = 4
}, {
cb = 0x4bc5f0 <ucp_rndv_atp_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4a0b30 <ucp_put_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x49e810 <ucp_get_req_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x49fe80 <ucp_get_rep_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0x14,
flags = 4
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0x15,
flags = 4
}, {
cb = 0x49eb50 <ucp_rma_cmpl_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x446300 <ucp_am_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x445530 <ucp_am_long_first_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x4492a0 <ucp_am_long_middle_handler>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x448b90 <ucp_am_handler_reply>,
arg = 0x305a000,
flags = 0
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0x1b,
flags = 4
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
--Type <RET> for more, q to quit, c to continue without paging--
arg = 0x1c,
flags = 4
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0x1d,
flags = 4
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0x1e,
flags = 4
}, {
cb = 0x53cd60 <uct_iface_stub_am_handler>,
arg = 0x1f,
flags = 4
}},
am_tracer = 0x46cd40 <ucp_worker_am_tracer>,
am_tracer_arg = 0x305a000,
err_handler = 0x477b70 <ucp_worker_iface_error_handler>,
err_handler_arg = 0x305a000,
err_handler_flags = 4,
prog = {
id = 1,
refcount = 0
},
progress_flags = 3,
config = {
num_alloc_methods = 3,
alloc_methods = {UCT_ALLOC_METHOD_MD, UCT_ALLOC_METHOD_MMAP, UCT_ALLOC_METHOD_HEAP, UCT_ALLOC_METHOD_THP, UCT_ALLOC_METHOD_THP},
failure_level = UCS_LOG_LEVEL_DIAG,
max_num_eps = 18446744073709551615
}
},
config = {
bandwidth = 16106127360
}
},
recv_fifo_mem = {
address = 0x7f2d9b214000,
length = 36864,
method = UCT_ALLOC_METHOD_MD,
mem_type = UCS_MEMORY_TYPE_HOST,
md = 0x25f7280,
memh = 0x25f6680
},
recv_fifo_ctl = 0x7f2d9b214000,
recv_fifo_elems = 0x7f2d9b2140c0,
read_index_elem = 0x7f2d9b2172c0,
read_index = 239204,
fifo_shift = 8 '\b',
fifo_mask = 255,
fifo_release_factor_mask = 127,
fifo_poll_count = 1,
fifo_prev_wnd_cons = 0,
recv_desc_mp = {
freelist = 0x7f2d9a588750,
data = 0x261b420
},
last_recv_desc = 0x7f2d9a4c6018,
--Type <RET> for more, q to quit, c to continue without paging--
signal_fd = 53,
rx_headroom = 80,
arbiter = {
list = {
prev = 0x2655790,
next = 0x2655790
}
},
release_desc = {
cb = 0x540df0 <uct_mm_iface_release_desc>
},
config = {
fifo_size = 256,
fifo_elem_size = 128,
seg_size = 8256,
fifo_max_poll = 16,
extra_cap_flags = 274877906944,
overhead = {
send = {
am_short = 1e-08,
am_bcopy = 1e-08
},
recv = {
am_short = 1e-08,
am_bcopy = 1e-08
}
}
}
}
(gdb)
(gdb) p iface->config
$50 = {
fifo_size = 256,
fifo_elem_size = 128,
seg_size = 8256,
fifo_max_poll = 16,
extra_cap_flags = 274877906944,
overhead = {
send = {
am_short = 1e-08,
am_bcopy = 1e-08
},
recv = {
am_short = 1e-08,
am_bcopy = 1e-08
}
}
}
(gdb) p payload_offset
No symbol "payload_offset" in current context.
(gdb) p iface->rx_headroom
$51 = 80
(gdb) p iface->recv_fifo_mem
$52 = {
address = 0x7f2d9b214000,
length = 36864,
method = UCT_ALLOC_METHOD_MD,
mem_type = UCS_MEMORY_TYPE_HOST,
md = 0x25f7280,
memh = 0x25f6680
}
(gdb) p iface->recv_fifo_mem.memh
$53 = (uct_mem_h) 0x25f6680
(gdb) p (uct_mm_seg_t*)iface->recv_fifo_mem.memh
$54 = (uct_mm_seg_t *) 0x25f6680
(gdb) p *(uct_mm_seg_t*)iface->recv_fifo_mem.memh
$55 = {
seg_id = 9732101,
address = 0x7f2d9b214000,
length = 36864
}
(gdb)
[root@node223 xrpc]# ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 9732099 root 660 36864 1 dest
0x00000000 9732100 root 660 6291456 1 dest
0x00000000 9732101 root 660 36864 1 dest
0x00000000 9732102 root 660 6291456 1 dest
0x00000000 9732103 root 660 36864 1 dest
0x00000000 9732104 root 660 6291456 1 dest
0x00000000 9732105 root 660 36864 1 dest
0x00000000 9732106 root 660 6291456 1 dest
0x00000000 9732107 root 660 36864 1 dest
0x00000000 9732108 root 660 6291456 1 dest
0x00000000 9732109 root 660 36864 1 dest
0x00000000 9732110 root 660 6291456 1 dest
0x00000000 9732111 root 660 36864 1 dest
0x00000000 9732112 root 660 6291456 1 dest
0x00000000 9732113 root 660 36864 1 dest
0x00000000 9732114 root 660 6291456 1 dest
Hi @yosefe, I've got some new clues.
I printed the FIFO data of the server through gdb, and found that the function uct_mm_iface_fifo_has_new_data was judged to be false, which caused subsequent new requests entering the queue could not be processed.
(gdb) p iface->read_index_elem[0]
$23 = {
flags = 1 '\001',
am_id = 26 '\032',
length = 4193,
desc = {
seg_id = 9732102,
seg_size = 6291456,
offset = 5265344
},
desc_data = 0x7f2d9a5057c0
}
(gdb) p iface->read_index
$26 = 239204
static UCS_F_ALWAYS_INLINE int
uct_mm_iface_fifo_has_new_data(uct_mm_iface_t *iface)
{
// ((iface->read_index >> iface->fifo_shift) & 1) => (239204 >> 8) & 1 => 0
// (iface->read_index_elem->flags & 1) => 1 & 1 => 1
// return false
return (((iface->read_index >> iface->fifo_shift) & 1) ==
(iface->read_index_elem->flags & 1));
}
Hi @yosefe
I may already know the root cause of the problem.
https://github.com/openucx/ucx/blob/9ce35d0545b9298d579c50deb8763f2346d0f577/src/uct/sm/mm/base/mm_ep.c#L375-L378
When the client crashes before line 378, the statement elem->flags = elem_flags; is not executed, which causes the elem to become an uncommitted request.
Therefore, when a new client continues to submit elem, the server does not respond.
Should ucx add a check to mark that the process to which the elem belongs no longer exists, so that processing of the elem can be skipped? Or some other solution?
I can reproduce the problem perfectly with fault injection.
- start server
UCX_TCP_CM_REUSEADDR=y UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y ./examples/.libs/ucp_client_server -c am -s 4096
- start client with gdb
UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y /opt/rh/gcc-toolset-12/root/bin/gdb --args ./examples/.libs/ucp_client_server -a 10.16.153.78 -c am -s 4096
(gdb) b mm_ep.c:378 if elem->am_id == 23
No source file named mm_ep.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (mm_ep.c:367 if elem->am_id == 23) pending.
(gdb) r
Thread 1 "ucp_client_serv" hit Breakpoint 1, uct_mm_ep_am_common_send (send_op=UCT_MM_SEND_AM_BCOPY, ep=0x525bb0, iface=0x41fd60, am_id=23 '\027', length=4104, header=0, payload=0x0, pack_cb=0x7ffff7d4b313 <ucp_am_eager_single_bcopy_pack>, arg=0x43b080, iov=0x0, iovcnt=0, flags=0) at sm/mm/base/mm_ep.c:378
378 elem->flags = elem_flags;
(gdb) quit
A debugging session is active.
Inferior 1 [process 2553049] will be killed.
Quit anyway? (y or n) y
- start client again without gdb
UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y ./examples/.libs/ucp_client_server -a 10.16.153.78 -c am -s 4096
[root@10-16-153-78 ucx-1.18.0]# UCX_MM_ERROR_HANDLING=y UCX_PROTO_INFO=y ./examples/.libs/ucp_client_server -a 10.16.153.78 -c am -s 4096
[1735531853.469238] [10-16-153-78:2553055:0] +--------------------------------+----------------------------------------------------+
[1735531853.469275] [10-16-153-78:2553055:0] | client_server inter-node cfg#0 | active message by ucp_am_send* from host memory |
[1735531853.469278] [10-16-153-78:2553055:0] +--------------------------------+-------------------------------------------------+--+
[1735531853.469281] [10-16-153-78:2553055:0] +--------------------------------+-------------------------------------------------+--+
[1735531853.469314] [10-16-153-78:2553055:0] +--------------------------------+--------------------------------------------------------------------+
[1735531853.469321] [10-16-153-78:2553055:0] | client_server inter-node cfg#0 | active message by ucp_am_send* with reply flag from host memory |
[1735531853.469324] [10-16-153-78:2553055:0] +--------------------------------+-----------------------------------------------------------------+--+
[1735531853.469343] [10-16-153-78:2553055:0] +--------------------------------+-----------------------------------------------------------------+--+
[1735531853.474304] [10-16-153-78:2553055:0] +--------------------------------+----------------------------------------------------+
[1735531853.474333] [10-16-153-78:2553055:0] | client_server intra-node cfg#1 | active message by ucp_am_send* from host memory |
[1735531853.474337] [10-16-153-78:2553055:0] +--------------------------------+-----------------------------------------+----------+
[1735531853.474340] [10-16-153-78:2553055:0] | 0..8184 | short | tcp/ens4 |
[1735531853.474343] [10-16-153-78:2553055:0] | 8185..inf | multi-frag copy-in | tcp/ens4 |
[1735531853.474346] [10-16-153-78:2553055:0] +--------------------------------+-----------------------------------------+----------+
[1735531853.475635] [10-16-153-78:2553055:0] +--------------------------------+---------------------------------------------------------------------+
[1735531853.475656] [10-16-153-78:2553055:0] | client_server intra-node cfg#1 | active message by ucp_am_send*(fast-completion) from host memory |
[1735531853.475659] [10-16-153-78:2553055:0] +--------------------------------+----------------------------------------------------------+----------+
[1735531853.475663] [10-16-153-78:2553055:0] | 0..8184 | short | tcp/ens4 |
[1735531853.475666] [10-16-153-78:2553055:0] | 8185..262143 | multi-frag copy-in | tcp/ens4 |
[1735531853.475681] [10-16-153-78:2553055:0] | 256K..inf | (?) rendezvous fragmented copy-in copy-out | tcp/ens4 |
[1735531853.475685] [10-16-153-78:2553055:0] +--------------------------------+----------------------------------------------------------+----------+
[1735531853.478350] [10-16-153-78:2553055:0] +--------------------------------+-----------------------------------------------------------+
[1735531853.478376] [10-16-153-78:2553055:0] | client_server intra-node cfg#1 | active message by ucp_am_send*(multi) from host memory |
[1735531853.478380] [10-16-153-78:2553055:0] +--------------------------------+------------------------------------------------+----------+
[1735531853.478382] [10-16-153-78:2553055:0] | 0..8184 | short | tcp/ens4 |
[1735531853.478385] [10-16-153-78:2553055:0] | 8185..65528 | zero-copy | tcp/ens4 |
[1735531853.478388] [10-16-153-78:2553055:0] | 65529..inf | multi-frag copy-in | tcp/ens4 |
[1735531853.478391] [10-16-153-78:2553055:0] +--------------------------------+------------------------------------------------+----------+
[1735531853.479714] [10-16-153-78:2553055:0] +--------------------------------+--------------------------------------------------------------------+
[1735531853.479744] [10-16-153-78:2553055:0] | client_server intra-node cfg#1 | active message by ucp_am_send* with reply flag from host memory |
[1735531853.479747] [10-16-153-78:2553055:0] +--------------------------------+---------------------------------------------------------+----------+
[1735531853.479750] [10-16-153-78:2553055:0] | 0..8176 | short | tcp/ens4 |
[1735531853.479753] [10-16-153-78:2553055:0] | 8177..inf | multi-frag copy-in | tcp/ens4 |
[1735531853.479759] [10-16-153-78:2553055:0] +--------------------------------+---------------------------------------------------------+----------+
[1735531853.481060] [10-16-153-78:2553055:0] +--------------------------------+-------------------------------------------------------------------------------------+
[1735531853.481081] [10-16-153-78:2553055:0] | client_server intra-node cfg#1 | active message by ucp_am_send* with reply flag(fast-completion) from host memory |
[1735531853.481084] [10-16-153-78:2553055:0] +--------------------------------+--------------------------------------------------------------------------+----------+
[1735531853.481087] [10-16-153-78:2553055:0] | 0..8176 | short | tcp/ens4 |
[1735531853.481094] [10-16-153-78:2553055:0] | 8177..262143 | multi-frag copy-in | tcp/ens4 |
[1735531853.481103] [10-16-153-78:2553055:0] | 256K..inf | (?) rendezvous fragmented copy-in copy-out | tcp/ens4 |
[1735531853.481110] [10-16-153-78:2553055:0] +--------------------------------+--------------------------------------------------------------------------+----------+
[1735531853.482531] [10-16-153-78:2553055:0] +--------------------------------+---------------------------------------------------------------------------+
[1735531853.482553] [10-16-153-78:2553055:0] | client_server intra-node cfg#1 | active message by ucp_am_send* with reply flag(multi) from host memory |
[1735531853.482556] [10-16-153-78:2553055:0] +--------------------------------+----------------------------------------------------------------+----------+
[1735531853.482559] [10-16-153-78:2553055:0] | 0..594 | short | tcp/ens4 |
[1735531853.482564] [10-16-153-78:2553055:0] | 595..inf | multi-frag copy-in | tcp/ens4 |
[1735531853.482579] [10-16-153-78:2553055:0] +--------------------------------+----------------------------------------------------------------+----------+
Hang!
@ivanallen currently shared memory transport does not support detecting of the remote process crash. setting UCX_MM_ERROR_HANDLING=y disregards this limitation and enables selecting shared memory transport. i am not sure it will be enough to add a check before "elem->flags = elem_flags", since there are other places in the code that may suffer from similar problem. also, adding such check in a naïve way may add performance overhead.
Hi @yosefe Thank you for your reply.
If that's the case, then it seems that shared memory transport cannot be used in the production environment. It appears there is no self-healing method available for the server unless restarting it.
However, I still hope that UCX can resolve this issue, but it seems quite challenging at the moment.
We have related experience in the production environment, which is to maintain a separate FIFO queue for each producer(SPSC fifo), if this producer fails, then the server will remove the FIFO associated with this producer. The problem with this is that you may need to maintain a large number of FIFO queues, but this does not present much of a problem at the moment in our environment.
@ivanallen Hi, I’m also interesting in the RPC UCX. Actually I’m doing a similar project. It’s your project based on P2300 execution? Maybe we could have more communication.
@ivanallen Hi, I’m also interesting in the RPC UCX. Actually I’m doing a similar project. It’s your project based on P2300 execution? Maybe we could have more communication.
Hi @MoFHeka We didn't use P2300 execution.
The current solution is that the user provides executor, while the RPC framework does not. RPC only provides progress interfaces like ucx. Currently we have a shell for execution based on Fackbook folly library that is easy to use and can be integrated with folly coroutine, but users can also use their own execution such as spdk encapsulation, or other methods.