liburing icon indicating copy to clipboard operation
liburing copied to clipboard

send-zerocopy and linked timeout: how to deal buffer when timeout?

Open beldzhang opened this issue 3 years ago • 26 comments

prepare to upgrade our diskless server to use send-zerocopy

send sqe linked with a timeout sqe to detect network problem, linked-timeout will return ETIME when something happen, at this point, how to deal the buffer passed in by ...sendzc() ? can it just be released as regular socket program? or is it still locked(maybe?) by somewhere in kernel stack ?

beldzhang avatar Aug 25 '22 04:08 beldzhang

prepare to upgrade our diskless server to use send-zerocopy

That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)?

heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).

send sqe linked with a timeout sqe to detect network problem, linked-timeout will return ETIME when something happen, at this point, how to deal the buffer passed in by ...sendzc() ? can it just be released as regular socket program? or is it still locked(maybe?) by somewhere in kernel stack ?

Close() is often is not of much help, but let's say you call shutdown(SHUT_RDWR), then it's reasonable to assume that the socket will release buffers and post a notification in a bound amount of time (that's considering that you flushed notifications and so). As for whether you can reuse buffers immediately after shutdown() without waiting for notification CQEs, it's not currently supported, i.e. you'd need to wait for CQEs, but that's a good question for me to think about later.

isilence avatar Aug 29 '22 14:08 isilence

That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)? heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).

It is over TCP, payload size was requested by client side, from 512B to 256KB Already start upgrading codes, sendzc has a noticeable reduce of CPU usage on my server.

Close() is often is not of much help, but let's say you call shutdown(SHUT_RDWR) ...

Will try to simulate this after code upgraded.

another issue found: because splice consume higher CPU, and sendzc looks to resolve this, I do a re-design to use fixed-buffer, the linked sqes is:

1) read_fixed(data)
2) sendzc(head + data)
3) link_timeout

when testing with a solid image file and a raw hard drive, sqe 2) returned ECANCELED after processed about 1k~2k requests already checked 1) has right returned length, and happend with or without 3) it happens even there is only one client connected found in kernel: 5.15.63, 5.19.5, 6.0-rc3 EDIT: sqe 2) has problem when using send() instead of sendzc(), and occured with non fixed-buffer also.

BTW, I know sendfile() was not welcomed, after compared with a thread-mode server, sendfile() has the lowest cpu usage, in user-mode it is no need to care about a pipe, and it is useful to many kinds of use case(ex: http server). so is there any possibility to add this or something like?

beldzhang avatar Aug 31 '22 01:08 beldzhang

kernel....: 5.19.5 liburing..: 2.3-git tested with READ and READ_FIXED, it's the same. after change IOSQE_IO_LINK to IOSQE_IO_HARDLINK, this will NOT happen.

each ECANCELED will cause following 2 SQEs failed, SEND and TIMEOUT

[DEVEL-10:~/.../kernel/dell]# perf trace -e io_uring:io_uring_fail_link --call-graph fp
     0.000 iou-0/24077 io_uring:io_uring_fail_link(ctx: 0xffff8883a103c800, req: 0xffff88810ccef600, user_data: 140370983686400, opcode: 22, link: 0xffff88810cceff00, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       DEBUGEND_$NZ_IORING (/root/work/devel/2/common/tools/benckmark/idrive6)
                                       [0] ([unknown])
                                       [0] ([unknown])
     0.008 iou-0/24077 io_uring:io_uring_fail_link(ctx: 0xffff8883a103c800, req: 0xffff88810ccef600, user_data: 140370983686400, opcode: 22, link: 0xffff88810ccee400, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       DEBUGEND_$NZ_IORING (/root/work/devel/2/common/tools/benckmark/idrive6)
                                       [0] ([unknown])
                                       [0] ([unknown])

beldzhang avatar Sep 02 '22 18:09 beldzhang

found notification slot was removed from both kernel and liburing after some checking until 42ee2bd, there is no any flag to check send_zc was support by current running kernel. is there any planning to add some flag or just check the kernel version >= 6.0 ?

beldzhang avatar Sep 03 '22 18:09 beldzhang

Never rely on a kernel version check. Either just try and the command and see if it fails, or use io_uring_get_probe() and check if that opcode is marked as supported.

axboe avatar Sep 03 '22 18:09 axboe

found notification slot was removed from both kernel and liburing after some checking until 42ee2bd, there is no any flag to check send_zc was support by current running kernel. is there any planning to add some flag or just check the kernel version >= 6.0 ?

Agree with Jens on the testing side.

The API should be easier now. If it sends anything, io_uring returns 2 CQEs. The first one is the usual request completion with IORING_CQE_F_MORE set, and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF. If fails (i.e. cqe->res < 0), there will be only one CQE with no IORING_CQE_F_MORE. So, you can look for IORING_CQE_F_MORE to figure out whether there will be one or two CQEs. I'll write a man I think this week.

isilence avatar Sep 05 '22 13:09 isilence

That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)? heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).

It is over TCP, payload size was requested by client side, from 512B to 256KB Already start upgrading codes, sendzc has a noticeable reduce of CPU usage on my server.

Nice!

BTW, I know sendfile() was not welcomed, after compared with a thread-mode server, sendfile() has the lowest cpu usage, in user-mode it is no need to care about a pipe, and it is useful to many kinds of use case(ex: http server). so is there any possibility to add this or something like?

I don't have a strong opinion on that one, it could be implemented but would definitely require more changes than just adding a sendfile io_uring opcode.

isilence avatar Sep 05 '22 13:09 isilence

another issue found: because splice consume higher CPU, and sendzc looks to resolve this, I do a re-design to use fixed-buffer, the linked sqes is:

Fixed buffers are faster, but fwiw you can also use normal buffers with sendzc.

1) read_fixed(data)
2) sendzc(head + data)
3) link_timeout

Just a general thought on performance which might not matter for your case but still: linked timeouts are not the fastest thing, so I'd experiment and see if there is a noticeable speedup from not having them. One idea how to optimise it is to have a per-ring timer / timeout request (or probably a couple of them), and if/when they fire you issue a bunch of IORING_OP_ASYNC_CANCEL.

when testing with a solid image file and a raw hard drive, sqe 2) returned ECANCELED after processed about 1k~2k requests already checked 1) has right returned length, and happend with or without 3) it happens even there is only one client connected found in kernel: 5.15.63, 5.19.5, 6.0-rc3 EDIT: sqe 2) has problem when using send() instead of sendzc(), and occured with non fixed-buffer also.

Not entirely sure I get it, what do you mean by "happend with or without 3)"? Or maybe you have a simple reproducer? It should always post 3 CQEs per such link unless you use IORING_FEAT_CQE_SKIP.

isilence avatar Sep 05 '22 13:09 isilence

changed to io_uring_get_probe() to check sendzc support, thanks to Jens will testing new api with 6.0-rc4 this week, and will check is there any performance lost with linked-timeout.

--

for the abnormal ECANCELED, uploaded the testing program.

Not entirely sure I get it, what do you mean by "happend with or without 3)"? Or maybe you have a simple reproducer?

this happened when submit 2 SQEs (read/send) or 3 SQEs(read/send/timeout). output of testing server:

[DEVEL-10:~/.../uring/tests]# make s && ./server /dev/sda
gcc -g server.c -o server -l uring
 ++ image size = 250059350016
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 / 12413 !!
        resp-data: err=-125

in server.c, sending_timeout set to 1 to include timeout, 0 excluded. tested on 5.19.5.

issue-643-v1.zip

beldzhang avatar Sep 05 '22 17:09 beldzhang

The first one is the usual request completion with IORING_CQE_F_MORE set

in this one cqe.res carry out the right number of sent

and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF

in this one cqe.res is 0, so the use space must kept the sent number from first one to handle protocol data.

is this the designed behavior?

beldzhang avatar Sep 06 '22 00:09 beldzhang

The first one is the usual request completion with IORING_CQE_F_MORE set

in this one cqe.res carry out the right number of sent

and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF

in this one cqe.res is 0,

is this the designed behavior?

Yes, that's by design, can be changed if there is a good reason for that.

so the use space must kept the sent number from first one to handle protocol data.

Not sure why you'd need the result at that point. The user should not modify data in the buffer until it gets the notification, but it's still allowed to read it, including with syscalls like write / send. The usual scenario on getting a notification should be the userspace freeing / reusing the buffer, and I expect the user knows everything it needs about the buffer like it's length.

isilence avatar Sep 06 '22 09:09 isilence

oh... sorry for not describe clearly what want to say is the "protocol handling", in one request-response

recv-protocol-head
if (cqe.res == head_length)
    send-head-and-data
    if (cqe.res == required_length)
        next_loop

after change send-head-and-data to zerocopy, the cqe.res from F_NOTIF one is zero, can not be used to check the proper amount of sent, so save the cqe.res from F_MORE is required.

after more consideration, this is the best way to handle sendzc until now. +1.

beldzhang avatar Sep 06 '22 13:09 beldzhang

for the abnormal ECANCELED, uploaded the testing program.

this happened when submit 2 SQEs (read/send) or 3 SQEs(read/send/timeout). output of testing server:

[DEVEL-10:~/.../uring/tests]# make s && ./server /dev/sda
gcc -g server.c -o server -l uring
 ++ image size = 250059350016
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 / 12413 !!
        resp-data: err=-125

in server.c, sending_timeout set to 1 to include timeout, 0 excluded. tested on 5.19.5.

issue-643-v1.zip

Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.

isilence avatar Sep 08 '22 13:09 isilence

BTW, did you use O_DIRECT on the block device file when you was benchmarking zc send vs sendfile?

isilence avatar Sep 08 '22 13:09 isilence

Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.

my sda is a USB-SSD disk, and re-tested with a nvme disk on 6.0-rc4, it's the same. oh, all this test is on a raw-disk WITHOUT any mount point on it.

my system is Slackware 15.0 with customized kernel, might it related?

BTW, did you use O_DIRECT on the block device file when you was benchmarking zc send vs sendfile?

no, never open any image or disk with O_DIRECT

beldzhang avatar Sep 08 '22 14:09 beldzhang

Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.

my sda is a USB-SSD disk, and re-tested with a nvme disk on 6.0-rc4, it's the same. oh, all this test is on a raw-disk WITHOUT any mount point on it.

my system is Slackware 15.0 with customized kernel, might it related?

Finally caught the bug and see where it came from. I'd like to add your reported-by tag in the fix-commit so it's in the history that you found the bug and helped the community, but I'd need your full name and email. Looks like:

Reported-by: Pavel Begunkov <[email protected]>

isilence avatar Sep 09 '22 09:09 isilence

very happy to help the community, My Name: Beld Zhang E-Mail: [email protected]

thanks to all for the hard works !

beldzhang avatar Sep 09 '22 14:09 beldzhang

Thanks, I've updated the commit and also included a link to this issue.

axboe avatar Sep 09 '22 14:09 axboe

Thanks, I've updated the commit and also included a link to this issue.

patched 4d9cb92 to 6.0-rc4, my testing are ok now.

next will check/test error handling of this issue begin with.

beldzhang avatar Sep 09 '22 17:09 beldzhang

@beldzhang, does it go right? Looks you've deleted a message. Anyway, let me drop a couple of notes here:

  1. It'll be slower for small per-request payloads. So, for TCP, if you're sending 1MB per request, you have nothing to worry about, but 4KB it might be slower. If the payload size is not fixed, I'd recommend to mix normal and zc sends depending on payload sizes. I'm not sure what's the tipping point, but you can try to find it experimentally.

  2. I have a branch with zc sendmsg and so iov, will be sending for 6.1, tell me if it'd help you to increase the payload size.

  3. If you have problems, send send me perf profiles, one for setup with normal sends and one for zc sends. I'll take a look.

  4. Why I asked about O_DIRECT for reads is because w/o it it does an extra copy into the page cache, which is not great. Just fwiw

  5. Do you utilise 100% of CPU / core? You may want to pin the process to one core

isilence avatar Sep 11 '22 11:09 isilence

It might also be a latency problem. Don't forget that the second CQE will likely come with a good delay from the first one, and you absolutely don't want to wait for it before for sending more I/O. In other words, don't strictly keep:

  1. submit send
  2. wait send complete
  3. wait for buffer complete
  4. goto 1)

but rather more like

  1. submit send
  2. wait send complete
  3. goto 1
  4. (step 4.5, asynchronous) wait for buffer complete

isilence avatar Sep 11 '22 11:09 isilence

It might also be a latency problem...

yes, it is the latency... post that message too rush before found out in my hardware environment. using a usb-ethernet that has large tx-timer, previous testing not affectted but sendzc does.

In other words, don't strictly keep: 1. submit send, 2. wait send complete, 3. wait for buffer complete, 4. goto 1)

sadly current is like this, 3) spent long time to wait and cause the speed dropdown. will try to use second model in next devel cycle.

  1. It'll be slower for small per-request payloads...
  2. Why I asked about O_DIRECT for reads...

will testing next week.

already tested error handling and they are all expected. look like we could close this issue successfully soon.

thanks a lot for the notes and advices.

beldzhang avatar Sep 11 '22 12:09 beldzhang

  1. error handling pass

  2. sendzc have lower cpu usage even in sync waiting for buffer complete

  3. async waiting of buffer complete is postponed because buffer management redesign required

  4. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc in ...prep_recv() set flag MSG_WAITALL, but still got partial data recv, then following write sqe be canceled is it known? any plan to fix or not?

  5. in liburing.h ...prep_sendzc_fixed was removed, is possible add back?

beldzhang avatar Sep 12 '22 18:09 beldzhang

Finally caught the bug and see where it came from

the unexpected ECANCELED happened on the testing server with 2 CPU, 256KB random read and linked with send. nvme0n1 is no mountpoint.

[nzel-server:~/w/devel/2/test/uring-inactive/tests]# uname -r
6.0.0-rc5-64-6
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 /    51 !!
        resp-data: err=-125

testing program attached: issue-643-v2.zip also include cli_write.c, for

  1. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# uname -r
5.15.64-64-6
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
 !! partial recv data:  65450, require: 262144
 !! partial recv data: 163705, require: 196694
 !! partial recv data:  32741, require:  32989
 !! partial recv data: 229344, require: 262144
...

update: patched and tested: 62bb0647 hope is useful

[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 / 16372 !!
        resp-data: err=-125
[nzel-server:~]# perf trace -e io_uring:io_uring_fail_link --call-graph fp
     0.000 server/1620 io_uring:io_uring_fail_link(ctx: 0xffff888111027800, req: 0xffff888104098e00, user_data: 3, opcode: 22, link: 0xffff888104072400, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       ctx_flush_and_put ([kernel.kallsyms])              <-- this is different with previous
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       io_run_task_work_sig ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       syscall (/lib64/libc-2.33.so)
                                       [0] ([unknown])
                                       [0] ([unknown])
     0.018 server/1620 io_uring:io_uring_fail_link(ctx: 0xffff888111027800, req: 0xffff888104098e00, user_data: 3, opcode: 22, link: 0xffff888104098a00, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       ctx_flush_and_put ([kernel.kallsyms])
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       io_run_task_work_sig ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       syscall (/lib64/libc-2.33.so)
                                       [0] ([unknown])
                                       [0] ([unknown])

beldzhang avatar Sep 15 '22 03:09 beldzhang

finally found it in io_read(), if there is no retry, currently only passed the last ret of io_iter_do_read to kiocb_done then failed in __io_complete_rw_common() because ret != cqe.res, it should be the total amount of all io_iter_do_read got. and only increase io->bytes_done when there is retry.

try to fix it but can not get it done(don't know how to deal with retry).


add some log, found following info a. not understanding why cqe.res less b. when res in __io_complete_rw_common shorten, this req was marked fail, and final_ret is not the correct value also.

[nzel-server:~/uring-test]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 /    16 !!
        resp-data: err=-125
[  239.661253] ++io_read:  cqe.res(262144)
[  239.700637] ++io_read:  cqe.res(262144)
[  239.700806] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.701070] ++io_read:  cqe.res(262144)
[  239.701320] ++io_read:  cqe.res(262144)
[  239.701431] ++io_read:  cqe.res(  3072)
[  239.701433] ++__io_complete_rw_common: res=  3072, cqe.res=  3072
[  239.701586] ++io_read:  cqe.res(262144)
[  239.701848] ++io_read:  cqe.res(262144)
[  239.701952] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.702135] ++io_read:  cqe.res(262144)
[  239.702323] ++io_read:  cqe.res(262144)
[  239.702436] ++io_read:  cqe.res(  1536)
[  239.702439] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.702565] ++io_read:  cqe.res(262144)
[  239.702788] ++io_read:  cqe.res(262144)
[  239.702897] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.703037] ++io_read:  cqe.res(262144)
[  239.703297] ++io_read:  cqe.res(262144)
[  239.703395] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.703551] ++io_read:  cqe.res(262144)
[  239.703768] ++io_read:  cqe.res(262144)
[  239.703878] ++io_read:  cqe.res(  1024)
[  239.703880] ++__io_complete_rw_common: res=  1024, cqe.res=  1024
[  239.703993] ++io_read:  cqe.res(262144)
[  239.704192] ++io_read:  cqe.res(262144)
[  239.704312] ++io_read:  cqe.res(  2048)
[  239.704315] ++__io_complete_rw_common: res=  2048, cqe.res=  2048
[  239.704426] ++io_read:  cqe.res(262144)
[  239.704623] ++io_read:  cqe.res(262144)
[  239.704752] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.704867] ++io_read:  cqe.res(262144)
[  239.705065] ++io_read:  cqe.res(262144)
[  239.705181] ++io_read:  cqe.res(  2560)
[  239.705183] ++__io_complete_rw_common: res=  2560, cqe.res=  2560
[  239.705291] ++io_read:  cqe.res(262144)
[  239.705476] ++io_read:  cqe.res(262144)
[  239.705591] ++io_read:  cqe.res(  1024)
[  239.705593] ++__io_complete_rw_common: res=  1024, cqe.res=  1024
[  239.705695] ++io_read:  cqe.res(262144)
[  239.705905] ++io_read:  cqe.res(262144)
[  239.706030] ++io_read:  cqe.res(  1536)
[  239.706033] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.706109] ++io_read:  cqe.res(262144)
[  239.706290] ++io_read:  cqe.res(262144)
[  239.706391] ++io_read:  cqe.res(  1024)
[  239.706393] ++__io_complete_rw_common: res=  1024, cqe.res=  1024
[  239.706472] ++io_read:  cqe.res(262144)
[  239.706648] ++io_read:  cqe.res(262144)
[  239.706747] ++io_read:  cqe.res(  1536)
[  239.706749] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.706829] ++io_read:  cqe.res(262144)
[  239.707013] ++io_read:  cqe.res(262144)
[  239.707111] ++io_read:  cqe.res(  1536)
[  239.707113] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.707194] ++io_read:  cqe.res(262144)
[  239.707386] ++io_read:  cqe.res(262144)
[  239.707504] ++__io_complete_rw_common: res=  3072, cqe.res=262144
[  239.707505]  !! __io_complete_rw_common(): res(  3072) != cqe.res(262144)
[  239.707507]    ++ req_set_fail: op=22, flags=40200004
[  239.707508] CPU: 31 PID: 1457 Comm: server Not tainted 6.0.0-rc5-64-6b #21
[  239.707510] Hardware name: Dell Inc. Precision T5600/000000, BIOS A19 06/30/2019
[  239.707512] Call Trace:
[  239.707513]  <TASK>
[  239.707515]  dump_stack_lvl+0x34/0x44
[  239.707518]  __io_complete_rw_common+0xac/0xd0
[  239.707521]  kiocb_done.cold+0xc/0x3c
[  239.707524]  io_read.cold+0x10f/0x2cf
[  239.707526]  ? __kmalloc+0x64/0x4b0
[  239.707529]  ? update_load_avg+0x7e/0x740
[  239.707532]  ? newidle_balance+0x2ba/0x3f0
[  239.707534]  ? io_writev_prep_async+0x80/0x80
[  239.707537]  io_issue_sqe+0x60/0x410
[  239.707540]  ? _raw_spin_unlock+0x12/0x30
[  239.707542]  ? finish_task_switch.isra.0+0xa3/0x2b0
[  239.707545]  io_req_task_submit+0x2a/0x70
[  239.707548]  handle_tw_list+0x9a/0xd0
[  239.707551]  tctx_task_work+0x4e/0x120
[  239.707554]  task_work_run+0x59/0x90
[  239.707557]  io_run_task_work_sig+0x43/0x60
[  239.707559]  __do_sys_io_uring_enter+0x2a2/0x8a0
[  239.707561]  ? percpu_ref_get_many+0x30/0x30
[  239.707564]  do_syscall_64+0x3d/0x90
[  239.707566]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  239.707568] RIP: 0033:0x7f3662bb1c09
[  239.707571] Code: 48 8d 3d 6a 1f 0d 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2f e2 0c 00 f7 d8 64 89 01 48
[  239.707573] RSP: 002b:00007fff43f32918 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
[  239.707575] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3662bb1c09
[  239.707577] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000005
[  239.707578] RBP: 0000000000404100 R08: 0000000000000000 R09: 0000000000000008
[  239.707579] R10: 0000000000000001 R11: 0000000000000216 R12: 00007fff43f32970
[  239.707580] R13: 00007fff43f329b8 R14: 0000000000000000 R15: 0000000000000000
[  239.707582]  </TASK>

beldzhang avatar Sep 18 '22 00:09 beldzhang

ok, a wild wild west concept patch that fixed this abnormal ECANCELED, based on 6.0-rc6

--- rw-orig.c	2022-09-18 16:44:14.000000000 -0400
+++ rw.c	2022-09-18 19:53:58.725235656 -0400
@@ -706,7 +706,7 @@
 	struct kiocb *kiocb = &rw->kiocb;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct io_async_rw *io;
-	ssize_t ret, ret2;
+	ssize_t ret, ret2, ret_all = 0;
 	loff_t *ppos;
 
 	if (!req_has_async_data(req)) {
@@ -763,6 +763,7 @@
 	}
 
 	ret = io_iter_do_read(rw, &s->iter);
+	if (ret > 0) ret_all += ret;
 
 	if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
 		req->flags &= ~REQ_F_REISSUE;
@@ -814,7 +815,7 @@
 		iov_iter_advance(&s->iter, ret);
 		if (!iov_iter_count(&s->iter))
 			break;
-		io->bytes_done += ret;
+//		io->bytes_done += ret;
 		iov_iter_save_state(&s->iter, &s->iter_state);
 
 		/* if we can retry, do so with the callbacks armed */
@@ -830,8 +831,11 @@
 		 * here, and if we do, then just retry at the new offset.
 		 */
 		ret = io_iter_do_read(rw, &s->iter);
-		if (ret == -EIOCBQUEUED)
+		if (ret > 0) ret_all += ret;
+		if (ret == -EIOCBQUEUED) {
+			io->bytes_done += ret_all;
 			return IOU_ISSUE_SKIP_COMPLETE;
+		}
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
 		iov_iter_restore(&s->iter, &s->iter_state);
@@ -840,7 +844,7 @@
 	/* it's faster to check here then delegate to kfree */
 	if (iovec)
 		kfree(iovec);
-	return kiocb_done(req, ret, issue_flags);
+	return kiocb_done(req, ret_all, issue_flags);
 }
 
 int io_write(struct io_kiocb *req, unsigned int issue_flags)

beldzhang avatar Sep 19 '22 00:09 beldzhang

2. sendzc have lower cpu usage even in sync waiting for buffer complete

Great

3. async waiting of buffer complete is postponed because buffer management redesign required

4. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc
   in `...prep_recv()` set flag `MSG_WAITALL`, but still got partial data recv, then following write sqe be canceled
   is it known? any plan to fix or not?

The short read fix hasn't been ported to older kernels. Are you using a some stable kernel?

5. in liburing.h `...prep_sendzc_fixed` was removed, is possible add back?

Yeah, can add it back

isilence avatar Sep 26 '22 15:09 isilence

oh, there are two different issues:

  1. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc ...

this is found recently, and after comparing source between this versions, looks like it is difficult to backport due to huge changes, so please ignore it.


The short read fix hasn't been ported to older kernels.

the unexpected ECANCELED happened on the testing server with 2 CPU, 256KB random read and linked with send.

the short read is not completely fixed yet, post some messages start from here: https://github.com/axboe/liburing/issues/643#issuecomment-1247522867 testing codes already included. issue-643-v2.zip

in my debugging, found the bytes_done and req->cqe.res are lost syncing when there retries ...

Are you using a some stable kernel?

currently we are running 5.15.x and preparing upgrade to 6.0 or 6.1.

beldzhang avatar Sep 26 '22 17:09 beldzhang

Ah, I see. For backports, I do have them, just didn't send out as was waiting for another patch that it needs to land upstream

isilence avatar Sep 26 '22 18:09 isilence

finally found it in io_read(), if there is no retry, currently only passed the last ret of io_iter_do_read to kiocb_done then failed in __io_complete_rw_common() because ret != cqe.res, it should be the total amount of all io_iter_do_read got. and only increase io->bytes_done when there is retry.

cqe.res stores the numbers of bytes we're currently trying to do, so how about this?

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 76ebcfebc9a6..c562203d7a67 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -823,6 +823,7 @@ int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return -EAGAIN;
 		}
 
+		req->cqe.res = iov_iter_count(&s->iter);
 		/*
 		 * Now retry read with the IOCB_WAITQ parts set in the iocb. If
 		 * we get -EIOCBQUEUED, then we'll get a notification when the

I also pushed to a branch with an explanation for convenience.

https://github.com/isilence/linux/commit/9a70c235e440f48b10a8903ef95d2c9bbd7ff0b3

edit: branch https://github.com/isilence/linux.git rw-link-break

isilence avatar Sep 26 '22 22:09 isilence