eio icon indicating copy to clipboard operation
eio copied to clipboard

Fix pipe hang (issue #319)

Open haesbaert opened this issue 1 year ago • 8 comments

Turns out the kernel needs to handle blocking file descriptors differently, it needs to do thread-pooling and it's a different code path (also slower).

It also seems that this is bugged in some cases. This makes sense as I believe most people don't test with blocking FDs. I believe we should make all our FDs non blocking as a precaution.

There is a program listed in issue #319 that replicates this. I could hang it in < 5 seconds, and it's now running for 10minutes with the fix, so kaboom.

haesbaert avatar Sep 26 '22 10:09 haesbaert

There is a second hang in the direct_copy test, I'm not sure it is related now, but I could reproduce it before as well, this fixes the first one at least.

haesbaert avatar Sep 26 '22 10:09 haesbaert

There is a second hang in the direct_copy test, I'm not sure it is related now, but I could reproduce it before as well, this fixes the first one at least.

It's also likely I'm just hiding the original bug by changing some timing. Hold this for a while until I figure it out.

haesbaert avatar Sep 26 '22 10:09 haesbaert

Is there an upstream bug report for this somewhere (e.g. on https://github.com/axboe/liburing)?

talex5 avatar Sep 26 '22 13:09 talex5

Is there an upstream bug report for this somewhere (e.g. on https://github.com/axboe/liburing)?

I have to check deeper, I found a bunch of stuff related to bugs with pipes, also the commit for adding splice support for nonblocking in pipes is from June this year, it's all very recent (https://lore.kernel.org/all/[email protected]/)

Basically setting the pipes to nonblocking fix the first case, but the second case makes the splice call return EAGAIN, and restarting/handling is not enough. it ends up hanging at some point.

If I make both pipes nonblocking and disable splicing, everything works.

I need to isolate things better before opening something upstream, but it all seems a bit dicey and immature, I'm writing C equivalents to make sure we're not missing something.

haesbaert avatar Sep 26 '22 13:09 haesbaert

As noted in https://github.com/ocaml-multicore/eio/issues/319#issuecomment-1258087688, I can reproduce the behaviour in that C program.

haesbaert avatar Sep 26 '22 14:09 haesbaert

I believe we should make all our FDs non blocking as a precaution.

After reading through this PR and comments, I did this for my application, which solved hangs I was seeing under eio_linux for eio-ssl

anmonteiro avatar Sep 26 '22 17:09 anmonteiro

Probably outside of the scope of this PR, but shouldn't we make the sockets non blocking by default too - if they are not already? I was under the impression that the uring and luv backends both set sockets fd to be non-blocking by default. Is this not the case?

bikallem avatar Sep 27 '22 10:09 bikallem

So I've started to poke uring with perf(1) and test pipe in blocking and non blocking, can confirm that the blocking is using the worker thread while non blocking is not:

-N is nonblocking

sam:tmp: sudo perf stat -a -e io_uring:io_uring_poll_arm,io_uring:io_uring_queue_async_work -- ./uring_tests
pipe_bug blocking=1
read(A) cqe->res = 6 (buf=foobar)
read(B) cqe->res = 0

 Performance counter stats for 'system wide':

                 0      io_uring:io_uring_poll_arm
                 3      io_uring:io_uring_queue_async_work

       0.001237004 seconds time elapsed

sam:tmp: sudo perf stat -a -e io_uring:io_uring_poll_arm,io_uring:io_uring_queue_async_work -- ./uring_tests -N
pipe_bug blocking=0
read(A) cqe->res = 6 (buf=foobar)
read(B) cqe->res = 0

 Performance counter stats for 'system wide':

                 0      io_uring:io_uring_poll_arm
                 0      io_uring:io_uring_queue_async_work

       0.001059103 seconds time elapsed

haesbaert avatar Sep 27 '22 12:09 haesbaert

I've ran some more tests today and this is where we are:

  • The fix has been commited to mainline linux: https://github.com/torvalds/linux/commit/46a525e199e4037516f7e498c18f065b09df32ac
  • We can't expect users to be on kernel 6.1.
  • The fix we can do now is use pipes as non-blocking and disable splice, I couldn't reproduce the bug with this.
  • The kernel fix also fixes the splice case, so it's the same bug, so in the future we can use it.
  • There are more things that need to be changed in u-ring but I wouldn't address that now, for now I'm concerned with just providing something that won't hang. So I think we should set pipes to non-block and disable splice.
  • Even with the kernel fix, splice can return EAGAIN and we have to handle.

haesbaert avatar Oct 19 '22 18:10 haesbaert

Is this ready to merge, or does it need splice to be disabled first?

talex5 avatar Oct 20 '22 14:10 talex5

Is this ready to merge, or does it need splice to be disabled first?

Apologies, I should have been more clear, it's not ready, it was more like checking if this would be ok. I've also wanted to hunt the kernel versions where the patch has been applied and maybe test conditionally for disabling it, but I didn't get to it.

haesbaert avatar Oct 20 '22 15:10 haesbaert

This should be good to go, I updated the commit message as well.

haesbaert avatar Oct 20 '22 20:10 haesbaert