liburing Handling non-blocking file descriptors consistently.

This is more of a discussion than an actual issue at the moment, but I hope to provide more concrete details as we explore the use cases.

We have an interesting problem regarding IORING_OP_READV/IORING_OP_WRITEV. In a general programming language environment, we don't always know if the file descriptor is in non-blocking mode or not.

But the behaviour of IORING_OP_READV/IORING_OP_WRITEV depends on it. In non-blocking mode, we can get a result of EAGAIN.

However, this is not very useful in the case of even driven concurrency where we want operations like this to be "asynchronous" "blocking" - i.e. the CQE should not be generated until the read/write could make progress.

In general, my understanding is, we'd need 3-4 system calls to work around this, something like this:

int current_state = get_nonblocking_state(fd);
if (current_state != nonblocking) set_nonblocking_state(fd);
io_uring_read(..., fd, ...)
if (nonblocking_state_was_changed) restore_nonblocking_state(fd, current_state);

We can cache this to reduce the round trips, but it feels like this would be something better solved by a flag on IORING_OP_READV/IORING_OP_WRITEV. In fact, I'm having a hard time imagining any valid use case for non-blocking operation in the uring.

I'll try to test out specific cases where this is a problem and report back.

Jun 16 '21 05:06 ioquatix

Whole point of io_uring is its asynchronous by design! If there is a blocking call it either handles in async or opens a new kernel thread for that request.

By using O_NONBLOCK what you are saying is, let me choose what to do next when blocking happens vs letting io_uring deal with it.

There already is an flags option.

fd = open('path/to/file', O_RDONLY)  # normal blocking read only open.
...
io_uring_prep_readv(sqe, fd, iovecs, nr_vecs, offset)
sqe.rw_flags |= RWF_NOWAIT  # now read is non-blocking
...
# catch `-EGAIN` in `cqe.res`, ...

Jun 16 '21 06:06 YoSTEALTH

Is there a flag for the opposite of RWF_NOWAIT?

Jun 16 '21 06:06 ioquatix

Not that I know of. If you want a blocking read just use normal read() don't use io_uring_read*

Jun 16 '21 07:06 YoSTEALTH

We have file descriptors that might already have O_NONBLOCK set. But we still want to use them with io_uring without getting EAGAIN.

Jun 16 '21 07:06 ioquatix

Hmm that's a tricky situation. Monkey patching O_NONBLOCK flag to not be non-blocking might lead to other problems/bugs.

If you have access to when the original flag is being set

original_flags = O_CREAT | O_RDWR | O_NONBLOCK

You could add a check like

if original_flags & O_NONBLOCK:
    # raise error saying `O_NONBLOCK` is not supported, invalid or something!

I am not really proud of that solution.

Why not just catch -EGAIN and poll it? io_uring_prep_poll_add(sqe, fd, POLLIN)

Jun 16 '21 07:06 YoSTEALTH

Yes it’s one option but we wanted to use the direct read/write operations if possible.

Jun 16 '21 08:06 ioquatix

suppose you can also remove the flag by:

original_flags = O_CREAT | O_RDWR | O_NONBLOCK
new_flags = original_flags & ~O_NONBLOCK

Jun 16 '21 19:06 YoSTEALTH

It would be a pretty trivial change to arm poll rather than return -EAGAIN to the application, if O_NONBLOCK is set on the file. The main issue here is how to inform io_uring of that desired behavior. OP_READV/OP_WRITEV flags are really readv2/writev2 flags, so it'd then be a generic change to allow this. Which arguably would make sense, as the justification would be that you can do blocking IO without having the query/set new file flags.

This applies to more than OP_READV/OP_WRITEV, though. It also applies to SEND/RECV and friends, so it's actually more of a generic change. Which means that an IOSQE_ flag would make more sense. The only potential issue there is that the flag space is pretty small, we have 8 bits and 6 are already used. Not a huge issue, we have two bits left, and the last bit could always be turned into a "look at extended bits" or something like that.

In any case, let me know if you want to test a patch for this. Should be easy to cook up.

Jun 16 '21 19:06 axboe

I am happy to test and help with this.

Jun 16 '21 20:06 ioquatix

Just thinking through what you described, do you think it would make sense for us to propose a flag to readv2/writev2 to be more specific about this?

That way, this doesn't become an io_uring specific feature, but works at the kernel level in a generic way.

e.g.

       RWF_NONBLOCK
              Perform the operation as if the descriptor was in `O_NONBLOCK` mode.
       RWF_BLOCK
              Perform the operation as if the descriptor was not in `O_NONBLOCK` mode.

Maybe it's crazy proposal, but that seems logical to me, at least from my POV.

Jun 16 '21 22:06 ioquatix

As mentioned, that only then works for the read/write variants, not for any other opcode. Hence I'd be leaning towards making it an sqe flag instead. If it was just applicable to OP_READV and friends, then the RWF approach would be preferable.

Jun 16 '21 22:06 axboe

Do other opcodes also take flags? e.g. accept, open, close, etc.

Jun 16 '21 22:06 ioquatix

Each opcode has a specific set of flags, like the RWF_* for read/write, but there's also a generic set of IOSQE_* flags that apply to any opcode. That handles things like links, etc. Things that sit above the specific request type.

Jun 16 '21 23:06 axboe

Oh that makes total sense. Okay, I'm keen to test this when you are ready.

cc @ciconia

Jun 16 '21 23:06 ioquatix

I’ll spin a patch tomorrow morning.

On Jun 16, 2021, at 5:56 PM, Samuel Williams @.***> wrote:

Oh that makes total sense. Okay, I'm keen to test this when you are ready.

cc @ciconia

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Jun 17 '21 00:06 axboe

It would be a pretty trivial change to arm poll rather than return -EAGAIN to the application, if O_NONBLOCK is set on the file.

@axboe O.o something like IOSQE_IO_BLOCKING_POLL*? If you can make this into an feature I would totally welcome it as well. Currently I have lines of code just checking -EAGAIN and polling it to account for both blocking/non-blocking flags. This feature would save many unnecessary round-trips.

Rather then just account for EAGAIN it would be nice if it accounted for "BLOCKING" states like EAGAIN, EALREADY, EINPROGRESS, EWOULDBLOCK, ...

Jun 17 '21 06:06 YoSTEALTH

I agree, the vast majority of user space code doesn't care about EAGAIN, it's like the half way house of concurrency.

Jun 17 '21 06:06 ioquatix

That's a great idea, adding such a flag will be very useful! Keep in mind though that if you want to support older Linux kernel versions (once this feature lands in a kernel release), you'll still have to provide some workaround, either resetting O_NONBLOCK or handling EAGAIN.

Jun 17 '21 09:06 noteflakes

Probably something like IOSQE_IGNORE_NONBLOCK or whatever. It'll just be a way to force the normal poll path for pollable IO even if the fd is marked non-blocking.

Jun 17 '21 16:06 axboe

Here's a first attempt. Let me know if you'd prefer a patch somewhere, not sure what this does to formatting... If it destroys whitespace, probably just use patch -l to ignore that part.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 46a25a7cb70a..25f72f35902d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -108,7 +108,7 @@
 
 #define SQE_VALID_FLAGS	(IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK|	\
 				IOSQE_IO_HARDLINK | IOSQE_ASYNC | \
-				IOSQE_BUFFER_SELECT)
+				IOSQE_BUFFER_SELECT | IOSQE_IGNORE_NONBLOCK)
 
 #define IO_TCTX_REFS_CACHE_NR	(1U << 10)
 
@@ -704,6 +704,7 @@ enum {
 	REQ_F_HARDLINK_BIT	= IOSQE_IO_HARDLINK_BIT,
 	REQ_F_FORCE_ASYNC_BIT	= IOSQE_ASYNC_BIT,
 	REQ_F_BUFFER_SELECT_BIT	= IOSQE_BUFFER_SELECT_BIT,
+	REQ_F_IGNORE_NONBLOCK_BIT	= IOSQE_IGNORE_NONBLOCK_BIT,
 
 	/* first byte is taken by user flags, shift it to not overlap */
 	REQ_F_FAIL_BIT		= 8,
@@ -740,6 +741,8 @@ enum {
 	REQ_F_FORCE_ASYNC	= BIT(REQ_F_FORCE_ASYNC_BIT),
 	/* IOSQE_BUFFER_SELECT */
 	REQ_F_BUFFER_SELECT	= BIT(REQ_F_BUFFER_SELECT_BIT),
+	/* IOSQE_IGNORE_NONBLOCK */
+	REQ_F_IGNORE_NONBLOCK	= BIT(REQ_F_IGNORE_NONBLOCK_BIT),
 
 	/* fail rest of links */
 	REQ_F_FAIL		= BIT(REQ_F_FAIL_BIT),
@@ -2444,6 +2447,15 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 	return true;
 }
 
+/*
+ * Honor REQ_F_NOWAIT only if REQ_F_IGNORE_NONBLOCK isn't set
+ */
+static bool io_req_nowait(struct io_kiocb *req)
+{
+	return (req->flags & (REQ_F_NOWAIT | REQ_F_IGNORE_NONBLOCK)) ==
+		REQ_F_NOWAIT;
+}
+
 static bool io_rw_should_reissue(struct io_kiocb *req)
 {
 	umode_t mode = file_inode(req->file)->i_mode;
@@ -2451,7 +2463,7 @@ static bool io_rw_should_reissue(struct io_kiocb *req)
 
 	if (!S_ISBLK(mode) && !S_ISREG(mode))
 		return false;
-	if ((req->flags & REQ_F_NOWAIT) || (io_wq_current_is_worker() &&
+	if (io_req_nowait(req) || (io_wq_current_is_worker() &&
 	    !(ctx->flags & IORING_SETUP_IOPOLL)))
 		return false;
 	/*
@@ -3223,7 +3235,7 @@ static bool io_rw_should_retry(struct io_kiocb *req)
 	struct kiocb *kiocb = &req->rw.kiocb;
 
 	/* never retry for NOWAIT, we just complete with -EAGAIN */
-	if (req->flags & REQ_F_NOWAIT)
+	if (io_req_nowait(req))
 		return false;
 
 	/* Only for buffered IO */
@@ -3303,7 +3315,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		if (!force_nonblock && !(req->ctx->flags & IORING_SETUP_IOPOLL))
 			goto done;
 		/* no retry on NONBLOCK nor RWF_NOWAIT */
-		if (req->flags & REQ_F_NOWAIT)
+		if (io_req_nowait(req))
 			goto done;
 		/* some cases will consume bytes even on error returns */
 		iov_iter_revert(iter, io_size - iov_iter_count(iter));
@@ -3311,7 +3323,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
 	} else if (ret <= 0 || ret == io_size || !force_nonblock ||
-		   (req->flags & REQ_F_NOWAIT) || !(req->flags & REQ_F_ISREG)) {
+		   io_req_nowait(req) || !(req->flags & REQ_F_ISREG)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
@@ -3434,7 +3446,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	if (ret2 == -EOPNOTSUPP && (kiocb->ki_flags & IOCB_NOWAIT))
 		ret2 = -EAGAIN;
 	/* no retry on NONBLOCK nor RWF_NOWAIT */
-	if (ret2 == -EAGAIN && (req->flags & REQ_F_NOWAIT))
+	if (ret2 == -EAGAIN && io_req_nowait(req))
 		goto done;
 	if (!force_nonblock || ret2 != -EAGAIN) {
 		/* IOPOLL retry should happen for io-wq threads */
@@ -6455,7 +6467,7 @@ static void __io_queue_sqe(struct io_kiocb *req)
 		} else {
 			io_put_req(req);
 		}
-	} else if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) {
+	} else if (ret == -EAGAIN && !io_req_nowait(req)) {
 		if (!io_arm_poll_handler(req)) {
 			/*
 			 * Queued up for async execution, worker will release
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f1f9ac114b51..582eee6e898b 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -70,6 +70,7 @@ enum {
 	IOSQE_IO_HARDLINK_BIT,
 	IOSQE_ASYNC_BIT,
 	IOSQE_BUFFER_SELECT_BIT,
+	IOSQE_IGNORE_NONBLOCK_BIT,
 };
 
 /*
@@ -87,6 +88,8 @@ enum {
 #define IOSQE_ASYNC		(1U << IOSQE_ASYNC_BIT)
 /* select buffer from sqe->buf_group */
 #define IOSQE_BUFFER_SELECT	(1U << IOSQE_BUFFER_SELECT_BIT)
+/* ignore if file has O_NONBLOCK set */
+#define IOSQE_IGNORE_NONBLOCK	(1U << IOSQE_IGNORE_NONBLOCK_BIT)
 
 /*
  * io_uring_setup() flags

Jun 17 '21 16:06 axboe

I will try this out over the weekend.

Jun 17 '21 21:06 ioquatix

I will try this out over the weekend.

I'll run some quick testing on this later, I'll post an updated version here if needed.

Jun 17 '21 21:06 axboe

That's neat,

Does it account for cases where O_NONBLOCK is not set but MSG_DONTWAIT flag is set directly into socket *recv?

Can this be in next Linux 5.13-rc* patch or its more of a 5.14 thing.

Jun 18 '21 08:06 YoSTEALTH

I'm hoping we can transparently support this by using the poll fallback - i.e. assume it's going to work, but if we get EAGAIN, go to a fallback path.

Jun 18 '21 08:06 ioquatix

Ok, seems I need to ask unpopular questions...

Disregarding generality for other io_uring requests, I have a problem with that design. We have a behaviour, blocking or not, and it is specified by rw flags (and also by file flags). So, it's blocking by default and RWF_NOWAIT overwrites the behaviour to non-blocking. This one will be overriding overridden behaviour, that doesn't sound good from the design perspective. To give the idea about concern, reductio to absurdium here would be to also add IOSQE_IGNORE_IGNORE_NONBLOCK

So, the question here, does it add something that can't be currently achieved? (e.g. by clearing/not non-blocking flags). Finer blocking control or just yet another BLOCK/NONBLOCK flag doing the same thing? If there are inconsistencies in io_uring handling noblock (and I remember there were at some point), we need to figure out and document how it should be treated.

Also curious, assuming that the library takes fds from the user but doesn't take over the ownership, i.e. the user may use it in any other way it wish in parallel:

How come that you force 'blocking' behaviour for userspace passing in non-blocking file, when it can be the desired outcome?
Why the requirement of passing non-blocking files can't be forced on the user?

Jun 18 '21 12:06 isilence

Can this be in next Linux 5.13-rc* patch or its more of a 5.14 thing.

@YoSTEALTH, 5.13 is feature closed and will be released pretty soon, so definitely not 5.13

Jun 18 '21 12:06 isilence

Pavel, outside of the patch being a quick'n dirty, the intent is only to ignore O_NONBLOCK if set. We should not be fiddling with RWF_NOWAIT or ditto send/recvmsg flags. Just for per-file state, not per request.

Jun 18 '21 15:06 axboe

@axboe My initial thought was if cqe.res == -EAGAIN and IOSQE_new_flag: wait/poll for task to finish at kernel-space, thus preventing a round trip to user-space and back. This way no set flags are messed with only checks for cqe end of the data. Is that even close to whats happening?

Jun 18 '21 15:06 YoSTEALTH

What you described, but only if O_NONBLOCK is set on the file. If you are setting request specific flags, like RWF_NOWAIT, then the -EAGAIN is returned as before. At that point you are specifically asking for non-block checking, the new sqe flag should only apply to state that you don't necessarily control (eg the file flags).

Jun 18 '21 16:06 axboe

@axboe OK, thank you for clarifying. So it can only be used in *open* atm. Which is such a small use-case! wouldn't it have been better for developer to new_flags = original_flags & ~O_NONBLOCK way to remove the flag.

As much as I would like to see new feature added to io_uring I also wouldn't want to waste resources that could be better used!

Jun 18 '21 16:06 YoSTEALTH

liburing liburing copied to clipboard

Handling non-blocking file descriptors consistently.

liburing
liburing copied to clipboard