liburing
liburing copied to clipboard
send-zerocopy and linked timeout: how to deal buffer when timeout?
prepare to upgrade our diskless server to use send-zerocopy
send sqe linked with a timeout sqe to detect network problem, linked-timeout will return ETIME when something happen,
at this point, how to deal the buffer passed in by ...sendzc() ?
can it just be released as regular socket program? or is it still locked(maybe?) by somewhere in kernel stack ?
prepare to upgrade our diskless server to use send-zerocopy
That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)?
heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).
send sqe linked with a timeout sqe to detect network problem, linked-timeout will return ETIME when something happen, at this point, how to deal the buffer passed in by
...sendzc()? can it just be released as regular socket program? or is it still locked(maybe?) by somewhere in kernel stack ?
Close() is often is not of much help, but let's say you call shutdown(SHUT_RDWR), then it's reasonable to assume that the socket will release buffers and post a notification in a bound amount of time (that's considering that you flushed notifications and so). As for whether you can reuse buffers immediately after shutdown() without waiting for notification CQEs, it's not currently supported, i.e. you'd need to wait for CQEs, but that's a good question for me to think about later.
That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)? heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).
It is over TCP, payload size was requested by client side, from 512B to 256KB Already start upgrading codes, sendzc has a noticeable reduce of CPU usage on my server.
Close() is often is not of much help, but let's say you call
shutdown(SHUT_RDWR)...
Will try to simulate this after code upgraded.
another issue found: because splice consume higher CPU, and sendzc looks to resolve this, I do a re-design to use fixed-buffer, the linked sqes is:
1) read_fixed(data)
2) sendzc(head + data)
3) link_timeout
when testing with a solid image file and a raw hard drive, sqe 2) returned ECANCELED after processed about 1k~2k requests already checked 1) has right returned length, and happend with or without 3) it happens even there is only one client connected found in kernel: 5.15.63, 5.19.5, 6.0-rc3 EDIT: sqe 2) has problem when using send() instead of sendzc(), and occured with non fixed-buffer also.
BTW, I know sendfile() was not welcomed, after compared with a thread-mode server, sendfile() has the lowest cpu usage, in user-mode it is no need to care about a pipe, and it is useful to many kinds of use case(ex: http server).
so is there any possibility to add this or something like?
kernel....: 5.19.5
liburing..: 2.3-git
tested with READ and READ_FIXED, it's the same.
after change IOSQE_IO_LINK to IOSQE_IO_HARDLINK, this will NOT happen.
each ECANCELED will cause following 2 SQEs failed, SEND and TIMEOUT
[DEVEL-10:~/.../kernel/dell]# perf trace -e io_uring:io_uring_fail_link --call-graph fp
0.000 iou-0/24077 io_uring:io_uring_fail_link(ctx: 0xffff8883a103c800, req: 0xffff88810ccef600, user_data: 140370983686400, opcode: 22, link: 0xffff88810cceff00, op_str: "READ")
io_disarm_next ([kernel.kallsyms])
io_disarm_next ([kernel.kallsyms])
io_free_batch_list ([kernel.kallsyms])
__io_submit_flush_completions ([kernel.kallsyms])
tctx_task_work ([kernel.kallsyms])
task_work_run ([kernel.kallsyms])
__do_sys_io_uring_enter ([kernel.kallsyms])
do_syscall_64 ([kernel.kallsyms])
entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
DEBUGEND_$NZ_IORING (/root/work/devel/2/common/tools/benckmark/idrive6)
[0] ([unknown])
[0] ([unknown])
0.008 iou-0/24077 io_uring:io_uring_fail_link(ctx: 0xffff8883a103c800, req: 0xffff88810ccef600, user_data: 140370983686400, opcode: 22, link: 0xffff88810ccee400, op_str: "READ")
io_disarm_next ([kernel.kallsyms])
io_disarm_next ([kernel.kallsyms])
io_free_batch_list ([kernel.kallsyms])
__io_submit_flush_completions ([kernel.kallsyms])
tctx_task_work ([kernel.kallsyms])
task_work_run ([kernel.kallsyms])
__do_sys_io_uring_enter ([kernel.kallsyms])
do_syscall_64 ([kernel.kallsyms])
entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
DEBUGEND_$NZ_IORING (/root/work/devel/2/common/tools/benckmark/idrive6)
[0] ([unknown])
[0] ([unknown])
found notification slot was removed from both kernel and liburing after some checking until 42ee2bd, there is no any flag to check send_zc was support by current running kernel. is there any planning to add some flag or just check the kernel version >= 6.0 ?
Never rely on a kernel version check. Either just try and the command and see if it fails, or use io_uring_get_probe() and check if that opcode is marked as supported.
found notification slot was removed from both kernel and liburing after some checking until 42ee2bd, there is no any flag to check send_zc was support by current running kernel. is there any planning to add some flag or just check the kernel version >= 6.0 ?
Agree with Jens on the testing side.
The API should be easier now. If it sends anything, io_uring returns 2 CQEs. The first one is the usual request completion with IORING_CQE_F_MORE set, and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF. If fails (i.e. cqe->res < 0), there will be only one CQE with no IORING_CQE_F_MORE. So, you can look for IORING_CQE_F_MORE to figure out whether there will be one or two CQEs. I'll write a man I think this week.
That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)? heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).
It is over TCP, payload size was requested by client side, from 512B to 256KB Already start upgrading codes, sendzc has a noticeable reduce of CPU usage on my server.
Nice!
BTW, I know
sendfile()was not welcomed, after compared with a thread-mode server,sendfile()has the lowest cpu usage, in user-mode it is no need to care about a pipe, and it is useful to many kinds of use case(ex: http server). so is there any possibility to add this or something like?
I don't have a strong opinion on that one, it could be implemented but would definitely require more changes than just adding a sendfile io_uring opcode.
another issue found: because splice consume higher CPU, and sendzc looks to resolve this, I do a re-design to use fixed-buffer, the linked sqes is:
Fixed buffers are faster, but fwiw you can also use normal buffers with sendzc.
1) read_fixed(data) 2) sendzc(head + data) 3) link_timeout
Just a general thought on performance which might not matter for your case but still: linked timeouts are not the fastest thing, so I'd experiment and see if there is a noticeable speedup from not having them. One idea how to optimise it is to have a per-ring timer / timeout request (or probably a couple of them), and if/when they fire you issue a bunch of IORING_OP_ASYNC_CANCEL.
when testing with a solid image file and a raw hard drive, sqe 2) returned ECANCELED after processed about 1k~2k requests already checked 1) has right returned length, and happend with or without 3) it happens even there is only one client connected found in kernel: 5.15.63, 5.19.5, 6.0-rc3 EDIT: sqe 2) has problem when using send() instead of sendzc(), and occured with non fixed-buffer also.
Not entirely sure I get it, what do you mean by "happend with or without 3)"? Or maybe you have a simple reproducer?
It should always post 3 CQEs per such link unless you use IORING_FEAT_CQE_SKIP.
changed to io_uring_get_probe() to check sendzc support, thanks to Jens
will testing new api with 6.0-rc4 this week, and will check is there any performance lost with linked-timeout.
--
for the abnormal ECANCELED, uploaded the testing program.
Not entirely sure I get it, what do you mean by "happend with or without 3)"? Or maybe you have a simple reproducer?
this happened when submit 2 SQEs (read/send) or 3 SQEs(read/send/timeout). output of testing server:
[DEVEL-10:~/.../uring/tests]# make s && ./server /dev/sda
gcc -g server.c -o server -l uring
++ image size = 250059350016
STARTING IO_URING....
++ client-accepted: 6
!! unexpected ECANCELED at 1 / 12413 !!
resp-data: err=-125
in server.c, sending_timeout set to 1 to include timeout, 0 excluded. tested on 5.19.5.
The first one is the usual request completion with IORING_CQE_F_MORE set
in this one cqe.res carry out the right number of sent
and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF
in this one cqe.res is 0, so the use space must kept the sent number from first one to handle protocol data.
is this the designed behavior?
The first one is the usual request completion with IORING_CQE_F_MORE set
in this one cqe.res carry out the right number of sent
and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF
in this one cqe.res is 0,
is this the designed behavior?
Yes, that's by design, can be changed if there is a good reason for that.
so the use space must kept the sent number from first one to handle protocol data.
Not sure why you'd need the result at that point. The user should not modify data in the buffer until it gets the notification, but it's still allowed to read it, including with syscalls like write / send. The usual scenario on getting a notification should be the userspace freeing / reusing the buffer, and I expect the user knows everything it needs about the buffer like it's length.
oh... sorry for not describe clearly what want to say is the "protocol handling", in one request-response
recv-protocol-head
if (cqe.res == head_length)
send-head-and-data
if (cqe.res == required_length)
next_loop
after change send-head-and-data to zerocopy, the cqe.res from F_NOTIF one is zero, can not be used to check the proper amount of sent, so save the cqe.res from F_MORE is required.
after more consideration, this is the best way to handle sendzc until now. +1.
for the abnormal
ECANCELED, uploaded the testing program.this happened when submit 2 SQEs (read/send) or 3 SQEs(read/send/timeout). output of testing server:
[DEVEL-10:~/.../uring/tests]# make s && ./server /dev/sda gcc -g server.c -o server -l uring ++ image size = 250059350016 STARTING IO_URING.... ++ client-accepted: 6 !! unexpected ECANCELED at 1 / 12413 !! resp-data: err=-125in server.c,
sending_timeoutset to 1 to include timeout, 0 excluded. tested on 5.19.5.
Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.
BTW, did you use O_DIRECT on the block device file when you was benchmarking zc send vs sendfile?
Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.
my sda is a USB-SSD disk, and re-tested with a nvme disk on 6.0-rc4, it's the same. oh, all this test is on a raw-disk WITHOUT any mount point on it.
my system is Slackware 15.0 with customized kernel, might it related?
BTW, did you use O_DIRECT on the block device file when you was benchmarking zc send vs sendfile?
no, never open any image or disk with O_DIRECT
Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.
my sda is a USB-SSD disk, and re-tested with a nvme disk on 6.0-rc4, it's the same. oh, all this test is on a raw-disk WITHOUT any mount point on it.
my system is Slackware 15.0 with customized kernel, might it related?
Finally caught the bug and see where it came from. I'd like to add your reported-by tag in the fix-commit so it's in the history that you found the bug and helped the community, but I'd need your full name and email. Looks like:
Reported-by: Pavel Begunkov <[email protected]>
very happy to help the community, My Name: Beld Zhang E-Mail: [email protected]
thanks to all for the hard works !
Thanks, I've updated the commit and also included a link to this issue.
Thanks, I've updated the commit and also included a link to this issue.
patched 4d9cb92 to 6.0-rc4, my testing are ok now.
next will check/test error handling of this issue begin with.
@beldzhang, does it go right? Looks you've deleted a message. Anyway, let me drop a couple of notes here:
-
It'll be slower for small per-request payloads. So, for TCP, if you're sending 1MB per request, you have nothing to worry about, but 4KB it might be slower. If the payload size is not fixed, I'd recommend to mix normal and zc sends depending on payload sizes. I'm not sure what's the tipping point, but you can try to find it experimentally.
-
I have a branch with zc sendmsg and so iov, will be sending for 6.1, tell me if it'd help you to increase the payload size.
-
If you have problems, send send me perf profiles, one for setup with normal sends and one for zc sends. I'll take a look.
-
Why I asked about O_DIRECT for reads is because w/o it it does an extra copy into the page cache, which is not great. Just fwiw
-
Do you utilise 100% of CPU / core? You may want to pin the process to one core
It might also be a latency problem. Don't forget that the second CQE will likely come with a good delay from the first one, and you absolutely don't want to wait for it before for sending more I/O. In other words, don't strictly keep:
- submit send
- wait send complete
- wait for buffer complete
- goto 1)
but rather more like
- submit send
- wait send complete
- goto 1
- (step 4.5, asynchronous) wait for buffer complete
It might also be a latency problem...
yes, it is the latency... post that message too rush before found out in my hardware environment. using a usb-ethernet that has large tx-timer, previous testing not affectted but sendzc does.
In other words, don't strictly keep: 1. submit send, 2. wait send complete, 3. wait for buffer complete, 4. goto 1)
sadly current is like this, 3) spent long time to wait and cause the speed dropdown. will try to use second model in next devel cycle.
- It'll be slower for small per-request payloads...
- Why I asked about O_DIRECT for reads...
will testing next week.
already tested error handling and they are all expected. look like we could close this issue successfully soon.
thanks a lot for the notes and advices.
-
error handling pass
-
sendzc have lower cpu usage even in sync waiting for buffer complete
-
async waiting of buffer complete is postponed because buffer management redesign required
-
another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc in
...prep_recv()set flagMSG_WAITALL, but still got partial data recv, then following write sqe be canceled is it known? any plan to fix or not? -
in liburing.h
...prep_sendzc_fixedwas removed, is possible add back?
Finally caught the bug and see where it came from
the unexpected ECANCELED happened on the testing server with 2 CPU, 256KB random read and linked with send.
nvme0n1 is no mountpoint.
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# uname -r
6.0.0-rc5-64-6
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
++ image size = 2048408248320
STARTING IO_URING....
++ client-accepted: 6
!! unexpected ECANCELED at 1 / 51 !!
resp-data: err=-125
testing program attached: issue-643-v2.zip also include cli_write.c, for
- another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# uname -r
5.15.64-64-6
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
++ image size = 2048408248320
STARTING IO_URING....
++ client-accepted: 6
!! partial recv data: 65450, require: 262144
!! partial recv data: 163705, require: 196694
!! partial recv data: 32741, require: 32989
!! partial recv data: 229344, require: 262144
...
update: patched and tested: 62bb0647 hope is useful
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
++ image size = 2048408248320
STARTING IO_URING....
++ client-accepted: 6
!! unexpected ECANCELED at 1 / 16372 !!
resp-data: err=-125
[nzel-server:~]# perf trace -e io_uring:io_uring_fail_link --call-graph fp
0.000 server/1620 io_uring:io_uring_fail_link(ctx: 0xffff888111027800, req: 0xffff888104098e00, user_data: 3, opcode: 22, link: 0xffff888104072400, op_str: "READ")
io_disarm_next ([kernel.kallsyms])
io_disarm_next ([kernel.kallsyms])
io_free_batch_list ([kernel.kallsyms])
__io_submit_flush_completions ([kernel.kallsyms])
ctx_flush_and_put ([kernel.kallsyms]) <-- this is different with previous
tctx_task_work ([kernel.kallsyms])
task_work_run ([kernel.kallsyms])
io_run_task_work_sig ([kernel.kallsyms])
__do_sys_io_uring_enter ([kernel.kallsyms])
do_syscall_64 ([kernel.kallsyms])
entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
syscall (/lib64/libc-2.33.so)
[0] ([unknown])
[0] ([unknown])
0.018 server/1620 io_uring:io_uring_fail_link(ctx: 0xffff888111027800, req: 0xffff888104098e00, user_data: 3, opcode: 22, link: 0xffff888104098a00, op_str: "READ")
io_disarm_next ([kernel.kallsyms])
io_disarm_next ([kernel.kallsyms])
io_free_batch_list ([kernel.kallsyms])
__io_submit_flush_completions ([kernel.kallsyms])
ctx_flush_and_put ([kernel.kallsyms])
tctx_task_work ([kernel.kallsyms])
task_work_run ([kernel.kallsyms])
io_run_task_work_sig ([kernel.kallsyms])
__do_sys_io_uring_enter ([kernel.kallsyms])
do_syscall_64 ([kernel.kallsyms])
entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
syscall (/lib64/libc-2.33.so)
[0] ([unknown])
[0] ([unknown])
finally found it
in io_read(), if there is no retry, currently only passed the last ret of io_iter_do_read to kiocb_done then failed in __io_complete_rw_common() because ret != cqe.res,
it should be the total amount of all io_iter_do_read got. and only increase io->bytes_done when there is retry.
try to fix it but can not get it done(don't know how to deal with retry).
add some log, found following info
a. not understanding why cqe.res less
b. when res in __io_complete_rw_common shorten, this req was marked fail, and final_ret is not the correct value also.
[nzel-server:~/uring-test]# ./server /dev/nvme0n1
++ image size = 2048408248320
STARTING IO_URING....
++ client-accepted: 6
!! unexpected ECANCELED at 1 / 16 !!
resp-data: err=-125
[ 239.661253] ++io_read: cqe.res(262144)
[ 239.700637] ++io_read: cqe.res(262144)
[ 239.700806] ++__io_complete_rw_common: res=262144, cqe.res=262144
[ 239.701070] ++io_read: cqe.res(262144)
[ 239.701320] ++io_read: cqe.res(262144)
[ 239.701431] ++io_read: cqe.res( 3072)
[ 239.701433] ++__io_complete_rw_common: res= 3072, cqe.res= 3072
[ 239.701586] ++io_read: cqe.res(262144)
[ 239.701848] ++io_read: cqe.res(262144)
[ 239.701952] ++__io_complete_rw_common: res=262144, cqe.res=262144
[ 239.702135] ++io_read: cqe.res(262144)
[ 239.702323] ++io_read: cqe.res(262144)
[ 239.702436] ++io_read: cqe.res( 1536)
[ 239.702439] ++__io_complete_rw_common: res= 1536, cqe.res= 1536
[ 239.702565] ++io_read: cqe.res(262144)
[ 239.702788] ++io_read: cqe.res(262144)
[ 239.702897] ++__io_complete_rw_common: res=262144, cqe.res=262144
[ 239.703037] ++io_read: cqe.res(262144)
[ 239.703297] ++io_read: cqe.res(262144)
[ 239.703395] ++__io_complete_rw_common: res=262144, cqe.res=262144
[ 239.703551] ++io_read: cqe.res(262144)
[ 239.703768] ++io_read: cqe.res(262144)
[ 239.703878] ++io_read: cqe.res( 1024)
[ 239.703880] ++__io_complete_rw_common: res= 1024, cqe.res= 1024
[ 239.703993] ++io_read: cqe.res(262144)
[ 239.704192] ++io_read: cqe.res(262144)
[ 239.704312] ++io_read: cqe.res( 2048)
[ 239.704315] ++__io_complete_rw_common: res= 2048, cqe.res= 2048
[ 239.704426] ++io_read: cqe.res(262144)
[ 239.704623] ++io_read: cqe.res(262144)
[ 239.704752] ++__io_complete_rw_common: res=262144, cqe.res=262144
[ 239.704867] ++io_read: cqe.res(262144)
[ 239.705065] ++io_read: cqe.res(262144)
[ 239.705181] ++io_read: cqe.res( 2560)
[ 239.705183] ++__io_complete_rw_common: res= 2560, cqe.res= 2560
[ 239.705291] ++io_read: cqe.res(262144)
[ 239.705476] ++io_read: cqe.res(262144)
[ 239.705591] ++io_read: cqe.res( 1024)
[ 239.705593] ++__io_complete_rw_common: res= 1024, cqe.res= 1024
[ 239.705695] ++io_read: cqe.res(262144)
[ 239.705905] ++io_read: cqe.res(262144)
[ 239.706030] ++io_read: cqe.res( 1536)
[ 239.706033] ++__io_complete_rw_common: res= 1536, cqe.res= 1536
[ 239.706109] ++io_read: cqe.res(262144)
[ 239.706290] ++io_read: cqe.res(262144)
[ 239.706391] ++io_read: cqe.res( 1024)
[ 239.706393] ++__io_complete_rw_common: res= 1024, cqe.res= 1024
[ 239.706472] ++io_read: cqe.res(262144)
[ 239.706648] ++io_read: cqe.res(262144)
[ 239.706747] ++io_read: cqe.res( 1536)
[ 239.706749] ++__io_complete_rw_common: res= 1536, cqe.res= 1536
[ 239.706829] ++io_read: cqe.res(262144)
[ 239.707013] ++io_read: cqe.res(262144)
[ 239.707111] ++io_read: cqe.res( 1536)
[ 239.707113] ++__io_complete_rw_common: res= 1536, cqe.res= 1536
[ 239.707194] ++io_read: cqe.res(262144)
[ 239.707386] ++io_read: cqe.res(262144)
[ 239.707504] ++__io_complete_rw_common: res= 3072, cqe.res=262144
[ 239.707505] !! __io_complete_rw_common(): res( 3072) != cqe.res(262144)
[ 239.707507] ++ req_set_fail: op=22, flags=40200004
[ 239.707508] CPU: 31 PID: 1457 Comm: server Not tainted 6.0.0-rc5-64-6b #21
[ 239.707510] Hardware name: Dell Inc. Precision T5600/000000, BIOS A19 06/30/2019
[ 239.707512] Call Trace:
[ 239.707513] <TASK>
[ 239.707515] dump_stack_lvl+0x34/0x44
[ 239.707518] __io_complete_rw_common+0xac/0xd0
[ 239.707521] kiocb_done.cold+0xc/0x3c
[ 239.707524] io_read.cold+0x10f/0x2cf
[ 239.707526] ? __kmalloc+0x64/0x4b0
[ 239.707529] ? update_load_avg+0x7e/0x740
[ 239.707532] ? newidle_balance+0x2ba/0x3f0
[ 239.707534] ? io_writev_prep_async+0x80/0x80
[ 239.707537] io_issue_sqe+0x60/0x410
[ 239.707540] ? _raw_spin_unlock+0x12/0x30
[ 239.707542] ? finish_task_switch.isra.0+0xa3/0x2b0
[ 239.707545] io_req_task_submit+0x2a/0x70
[ 239.707548] handle_tw_list+0x9a/0xd0
[ 239.707551] tctx_task_work+0x4e/0x120
[ 239.707554] task_work_run+0x59/0x90
[ 239.707557] io_run_task_work_sig+0x43/0x60
[ 239.707559] __do_sys_io_uring_enter+0x2a2/0x8a0
[ 239.707561] ? percpu_ref_get_many+0x30/0x30
[ 239.707564] do_syscall_64+0x3d/0x90
[ 239.707566] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 239.707568] RIP: 0033:0x7f3662bb1c09
[ 239.707571] Code: 48 8d 3d 6a 1f 0d 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2f e2 0c 00 f7 d8 64 89 01 48
[ 239.707573] RSP: 002b:00007fff43f32918 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
[ 239.707575] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3662bb1c09
[ 239.707577] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000005
[ 239.707578] RBP: 0000000000404100 R08: 0000000000000000 R09: 0000000000000008
[ 239.707579] R10: 0000000000000001 R11: 0000000000000216 R12: 00007fff43f32970
[ 239.707580] R13: 00007fff43f329b8 R14: 0000000000000000 R15: 0000000000000000
[ 239.707582] </TASK>
ok, a wild wild west concept patch that fixed this abnormal ECANCELED, based on 6.0-rc6
--- rw-orig.c 2022-09-18 16:44:14.000000000 -0400
+++ rw.c 2022-09-18 19:53:58.725235656 -0400
@@ -706,7 +706,7 @@
struct kiocb *kiocb = &rw->kiocb;
bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
struct io_async_rw *io;
- ssize_t ret, ret2;
+ ssize_t ret, ret2, ret_all = 0;
loff_t *ppos;
if (!req_has_async_data(req)) {
@@ -763,6 +763,7 @@
}
ret = io_iter_do_read(rw, &s->iter);
+ if (ret > 0) ret_all += ret;
if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
req->flags &= ~REQ_F_REISSUE;
@@ -814,7 +815,7 @@
iov_iter_advance(&s->iter, ret);
if (!iov_iter_count(&s->iter))
break;
- io->bytes_done += ret;
+// io->bytes_done += ret;
iov_iter_save_state(&s->iter, &s->iter_state);
/* if we can retry, do so with the callbacks armed */
@@ -830,8 +831,11 @@
* here, and if we do, then just retry at the new offset.
*/
ret = io_iter_do_read(rw, &s->iter);
- if (ret == -EIOCBQUEUED)
+ if (ret > 0) ret_all += ret;
+ if (ret == -EIOCBQUEUED) {
+ io->bytes_done += ret_all;
return IOU_ISSUE_SKIP_COMPLETE;
+ }
/* we got some bytes, but not all. retry. */
kiocb->ki_flags &= ~IOCB_WAITQ;
iov_iter_restore(&s->iter, &s->iter_state);
@@ -840,7 +844,7 @@
/* it's faster to check here then delegate to kfree */
if (iovec)
kfree(iovec);
- return kiocb_done(req, ret, issue_flags);
+ return kiocb_done(req, ret_all, issue_flags);
}
int io_write(struct io_kiocb *req, unsigned int issue_flags)
2. sendzc have lower cpu usage even in sync waiting for buffer complete
Great
3. async waiting of buffer complete is postponed because buffer management redesign required 4. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc in `...prep_recv()` set flag `MSG_WAITALL`, but still got partial data recv, then following write sqe be canceled is it known? any plan to fix or not?
The short read fix hasn't been ported to older kernels. Are you using a some stable kernel?
5. in liburing.h `...prep_sendzc_fixed` was removed, is possible add back?
Yeah, can add it back
oh, there are two different issues:
- another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc ...
this is found recently, and after comparing source between this versions, looks like it is difficult to backport due to huge changes, so please ignore it.
The short read fix hasn't been ported to older kernels.
the unexpected
ECANCELEDhappened on the testing server with 2 CPU, 256KB random read and linked with send.
the short read is not completely fixed yet, post some messages start from here: https://github.com/axboe/liburing/issues/643#issuecomment-1247522867 testing codes already included. issue-643-v2.zip
in my debugging, found the bytes_done and req->cqe.res are lost syncing when there retries ...
Are you using a some stable kernel?
currently we are running 5.15.x and preparing upgrade to 6.0 or 6.1.
Ah, I see. For backports, I do have them, just didn't send out as was waiting for another patch that it needs to land upstream
finally found it in
io_read(), if there is no retry, currently only passed the last ret ofio_iter_do_readtokiocb_donethen failed in__io_complete_rw_common()becauseret != cqe.res, it should be the total amount of allio_iter_do_readgot. and only increaseio->bytes_donewhen there is retry.
cqe.res stores the numbers of bytes we're currently trying to do, so how about this?
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 76ebcfebc9a6..c562203d7a67 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -823,6 +823,7 @@ int io_read(struct io_kiocb *req, unsigned int issue_flags)
return -EAGAIN;
}
+ req->cqe.res = iov_iter_count(&s->iter);
/*
* Now retry read with the IOCB_WAITQ parts set in the iocb. If
* we get -EIOCBQUEUED, then we'll get a notification when the
I also pushed to a branch with an explanation for convenience.
https://github.com/isilence/linux/commit/9a70c235e440f48b10a8903ef95d2c9bbd7ff0b3
edit: branch https://github.com/isilence/linux.git rw-link-break