liburing
liburing copied to clipboard
Handling non-blocking file descriptors consistently.
This is more of a discussion than an actual issue at the moment, but I hope to provide more concrete details as we explore the use cases.
We have an interesting problem regarding IORING_OP_READV/IORING_OP_WRITEV. In a general programming language environment, we don't always know if the file descriptor is in non-blocking mode or not.
But the behaviour of IORING_OP_READV/IORING_OP_WRITEV depends on it. In non-blocking mode, we can get a result of EAGAIN.
However, this is not very useful in the case of even driven concurrency where we want operations like this to be "asynchronous" "blocking" - i.e. the CQE should not be generated until the read/write could make progress.
In general, my understanding is, we'd need 3-4 system calls to work around this, something like this:
int current_state = get_nonblocking_state(fd);
if (current_state != nonblocking) set_nonblocking_state(fd);
io_uring_read(..., fd, ...)
if (nonblocking_state_was_changed) restore_nonblocking_state(fd, current_state);
We can cache this to reduce the round trips, but it feels like this would be something better solved by a flag on IORING_OP_READV/IORING_OP_WRITEV. In fact, I'm having a hard time imagining any valid use case for non-blocking operation in the uring.
I'll try to test out specific cases where this is a problem and report back.
Whole point of io_uring is its asynchronous by design! If there is a blocking call it either handles in async or opens a new kernel thread for that request.
By using O_NONBLOCK what you are saying is, let me choose what to do next when blocking happens vs letting io_uring deal with it.
There already is an flags option.
fd = open('path/to/file', O_RDONLY) # normal blocking read only open.
...
io_uring_prep_readv(sqe, fd, iovecs, nr_vecs, offset)
sqe.rw_flags |= RWF_NOWAIT # now read is non-blocking
...
# catch `-EGAIN` in `cqe.res`, ...
Is there a flag for the opposite of RWF_NOWAIT?
Not that I know of. If you want a blocking read just use normal read() don't use io_uring_read*
We have file descriptors that might already have O_NONBLOCK set. But we still want to use them with io_uring without getting EAGAIN.
Hmm that's a tricky situation. Monkey patching O_NONBLOCK flag to not be non-blocking might lead to other problems/bugs.
If you have access to when the original flag is being set
original_flags = O_CREAT | O_RDWR | O_NONBLOCK
You could add a check like
if original_flags & O_NONBLOCK:
# raise error saying `O_NONBLOCK` is not supported, invalid or something!
I am not really proud of that solution.
Why not just catch -EGAIN and poll it? io_uring_prep_poll_add(sqe, fd, POLLIN)
Yes it’s one option but we wanted to use the direct read/write operations if possible.
suppose you can also remove the flag by:
original_flags = O_CREAT | O_RDWR | O_NONBLOCK
new_flags = original_flags & ~O_NONBLOCK
It would be a pretty trivial change to arm poll rather than return -EAGAIN to the application, if O_NONBLOCK is set on the file. The main issue here is how to inform io_uring of that desired behavior. OP_READV/OP_WRITEV flags are really readv2/writev2 flags, so it'd then be a generic change to allow this. Which arguably would make sense, as the justification would be that you can do blocking IO without having the query/set new file flags.
This applies to more than OP_READV/OP_WRITEV, though. It also applies to SEND/RECV and friends, so it's actually more of a generic change. Which means that an IOSQE_ flag would make more sense. The only potential issue there is that the flag space is pretty small, we have 8 bits and 6 are already used. Not a huge issue, we have two bits left, and the last bit could always be turned into a "look at extended bits" or something like that.
In any case, let me know if you want to test a patch for this. Should be easy to cook up.
I am happy to test and help with this.
Just thinking through what you described, do you think it would make sense for us to propose a flag to readv2/writev2 to be more specific about this?
That way, this doesn't become an io_uring specific feature, but works at the kernel level in a generic way.
e.g.
RWF_NONBLOCK
Perform the operation as if the descriptor was in `O_NONBLOCK` mode.
RWF_BLOCK
Perform the operation as if the descriptor was not in `O_NONBLOCK` mode.
Maybe it's crazy proposal, but that seems logical to me, at least from my POV.
As mentioned, that only then works for the read/write variants, not for any other opcode. Hence I'd be leaning towards making it an sqe flag instead. If it was just applicable to OP_READV and friends, then the RWF approach would be preferable.
Do other opcodes also take flags? e.g. accept, open, close, etc.
Each opcode has a specific set of flags, like the RWF_* for read/write, but there's also a generic set of IOSQE_* flags that apply to any opcode. That handles things like links, etc. Things that sit above the specific request type.
Oh that makes total sense. Okay, I'm keen to test this when you are ready.
cc @ciconia
I’ll spin a patch tomorrow morning.
On Jun 16, 2021, at 5:56 PM, Samuel Williams @.***> wrote:
Oh that makes total sense. Okay, I'm keen to test this when you are ready.
cc @ciconia
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
It would be a pretty trivial change to arm poll rather than return -EAGAIN to the application, if O_NONBLOCK is set on the file.
@axboe O.o something like IOSQE_IO_BLOCKING_POLL*? If you can make this into an feature I would totally welcome it as well. Currently I have lines of code just checking -EAGAIN and polling it to account for both blocking/non-blocking flags. This feature would save many unnecessary round-trips.
Rather then just account for EAGAIN it would be nice if it accounted for "BLOCKING" states like EAGAIN, EALREADY, EINPROGRESS, EWOULDBLOCK, ...
I agree, the vast majority of user space code doesn't care about EAGAIN, it's like the half way house of concurrency.
That's a great idea, adding such a flag will be very useful! Keep in mind though that if you want to support older Linux kernel versions (once this feature lands in a kernel release), you'll still have to provide some workaround, either resetting O_NONBLOCK or handling EAGAIN.
Probably something like IOSQE_IGNORE_NONBLOCK or whatever. It'll just be a way to force the normal poll path for pollable IO even if the fd is marked non-blocking.
Here's a first attempt. Let me know if you'd prefer a patch somewhere, not sure what this does to formatting... If it destroys whitespace, probably just use patch -l to ignore that part.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 46a25a7cb70a..25f72f35902d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -108,7 +108,7 @@
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \
IOSQE_IO_HARDLINK | IOSQE_ASYNC | \
- IOSQE_BUFFER_SELECT)
+ IOSQE_BUFFER_SELECT | IOSQE_IGNORE_NONBLOCK)
#define IO_TCTX_REFS_CACHE_NR (1U << 10)
@@ -704,6 +704,7 @@ enum {
REQ_F_HARDLINK_BIT = IOSQE_IO_HARDLINK_BIT,
REQ_F_FORCE_ASYNC_BIT = IOSQE_ASYNC_BIT,
REQ_F_BUFFER_SELECT_BIT = IOSQE_BUFFER_SELECT_BIT,
+ REQ_F_IGNORE_NONBLOCK_BIT = IOSQE_IGNORE_NONBLOCK_BIT,
/* first byte is taken by user flags, shift it to not overlap */
REQ_F_FAIL_BIT = 8,
@@ -740,6 +741,8 @@ enum {
REQ_F_FORCE_ASYNC = BIT(REQ_F_FORCE_ASYNC_BIT),
/* IOSQE_BUFFER_SELECT */
REQ_F_BUFFER_SELECT = BIT(REQ_F_BUFFER_SELECT_BIT),
+ /* IOSQE_IGNORE_NONBLOCK */
+ REQ_F_IGNORE_NONBLOCK = BIT(REQ_F_IGNORE_NONBLOCK_BIT),
/* fail rest of links */
REQ_F_FAIL = BIT(REQ_F_FAIL_BIT),
@@ -2444,6 +2447,15 @@ static bool io_resubmit_prep(struct io_kiocb *req)
return true;
}
+/*
+ * Honor REQ_F_NOWAIT only if REQ_F_IGNORE_NONBLOCK isn't set
+ */
+static bool io_req_nowait(struct io_kiocb *req)
+{
+ return (req->flags & (REQ_F_NOWAIT | REQ_F_IGNORE_NONBLOCK)) ==
+ REQ_F_NOWAIT;
+}
+
static bool io_rw_should_reissue(struct io_kiocb *req)
{
umode_t mode = file_inode(req->file)->i_mode;
@@ -2451,7 +2463,7 @@ static bool io_rw_should_reissue(struct io_kiocb *req)
if (!S_ISBLK(mode) && !S_ISREG(mode))
return false;
- if ((req->flags & REQ_F_NOWAIT) || (io_wq_current_is_worker() &&
+ if (io_req_nowait(req) || (io_wq_current_is_worker() &&
!(ctx->flags & IORING_SETUP_IOPOLL)))
return false;
/*
@@ -3223,7 +3235,7 @@ static bool io_rw_should_retry(struct io_kiocb *req)
struct kiocb *kiocb = &req->rw.kiocb;
/* never retry for NOWAIT, we just complete with -EAGAIN */
- if (req->flags & REQ_F_NOWAIT)
+ if (io_req_nowait(req))
return false;
/* Only for buffered IO */
@@ -3303,7 +3315,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
if (!force_nonblock && !(req->ctx->flags & IORING_SETUP_IOPOLL))
goto done;
/* no retry on NONBLOCK nor RWF_NOWAIT */
- if (req->flags & REQ_F_NOWAIT)
+ if (io_req_nowait(req))
goto done;
/* some cases will consume bytes even on error returns */
iov_iter_revert(iter, io_size - iov_iter_count(iter));
@@ -3311,7 +3323,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
} else if (ret == -EIOCBQUEUED) {
goto out_free;
} else if (ret <= 0 || ret == io_size || !force_nonblock ||
- (req->flags & REQ_F_NOWAIT) || !(req->flags & REQ_F_ISREG)) {
+ io_req_nowait(req) || !(req->flags & REQ_F_ISREG)) {
/* read all, failed, already did sync or don't want to retry */
goto done;
}
@@ -3434,7 +3446,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (ret2 == -EOPNOTSUPP && (kiocb->ki_flags & IOCB_NOWAIT))
ret2 = -EAGAIN;
/* no retry on NONBLOCK nor RWF_NOWAIT */
- if (ret2 == -EAGAIN && (req->flags & REQ_F_NOWAIT))
+ if (ret2 == -EAGAIN && io_req_nowait(req))
goto done;
if (!force_nonblock || ret2 != -EAGAIN) {
/* IOPOLL retry should happen for io-wq threads */
@@ -6455,7 +6467,7 @@ static void __io_queue_sqe(struct io_kiocb *req)
} else {
io_put_req(req);
}
- } else if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) {
+ } else if (ret == -EAGAIN && !io_req_nowait(req)) {
if (!io_arm_poll_handler(req)) {
/*
* Queued up for async execution, worker will release
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f1f9ac114b51..582eee6e898b 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -70,6 +70,7 @@ enum {
IOSQE_IO_HARDLINK_BIT,
IOSQE_ASYNC_BIT,
IOSQE_BUFFER_SELECT_BIT,
+ IOSQE_IGNORE_NONBLOCK_BIT,
};
/*
@@ -87,6 +88,8 @@ enum {
#define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT)
/* select buffer from sqe->buf_group */
#define IOSQE_BUFFER_SELECT (1U << IOSQE_BUFFER_SELECT_BIT)
+/* ignore if file has O_NONBLOCK set */
+#define IOSQE_IGNORE_NONBLOCK (1U << IOSQE_IGNORE_NONBLOCK_BIT)
/*
* io_uring_setup() flags
I will try this out over the weekend.
I will try this out over the weekend.
I'll run some quick testing on this later, I'll post an updated version here if needed.
That's neat,
Does it account for cases where O_NONBLOCK is not set but MSG_DONTWAIT flag is set directly into socket *recv?
Can this be in next Linux 5.13-rc* patch or its more of a 5.14 thing.
I'm hoping we can transparently support this by using the poll fallback - i.e. assume it's going to work, but if we get EAGAIN, go to a fallback path.
Ok, seems I need to ask unpopular questions...
Disregarding generality for other io_uring requests, I have a problem with that design. We have a behaviour, blocking or not, and it is specified by rw flags (and also by file flags). So, it's blocking by default and RWF_NOWAIT overwrites the behaviour to non-blocking. This one will be overriding overridden behaviour, that doesn't sound good from the design perspective.
To give the idea about concern, reductio to absurdium here would be to also add IOSQE_IGNORE_IGNORE_NONBLOCK
So, the question here, does it add something that can't be currently achieved? (e.g. by clearing/not non-blocking flags). Finer blocking control or just yet another BLOCK/NONBLOCK flag doing the same thing? If there are inconsistencies in io_uring handling noblock (and I remember there were at some point), we need to figure out and document how it should be treated.
Also curious, assuming that the library takes fds from the user but doesn't take over the ownership, i.e. the user may use it in any other way it wish in parallel:
- How come that you force 'blocking' behaviour for userspace passing in non-blocking file, when it can be the desired outcome?
- Why the requirement of passing non-blocking files can't be forced on the user?
Can this be in next Linux
5.13-rc*patch or its more of a5.14thing.
@YoSTEALTH, 5.13 is feature closed and will be released pretty soon, so definitely not 5.13
Pavel, outside of the patch being a quick'n dirty, the intent is only to ignore O_NONBLOCK if set. We should not be fiddling with RWF_NOWAIT or ditto send/recvmsg flags. Just for per-file state, not per request.
@axboe My initial thought was if cqe.res == -EAGAIN and IOSQE_new_flag: wait/poll for task to finish at kernel-space, thus preventing a round trip to user-space and back. This way no set flags are messed with only checks for cqe end of the data. Is that even close to whats happening?
What you described, but only if O_NONBLOCK is set on the file. If you are setting request specific flags, like RWF_NOWAIT, then the -EAGAIN is returned as before. At that point you are specifically asking for non-block checking, the new sqe flag should only apply to state that you don't necessarily control (eg the file flags).
@axboe OK, thank you for clarifying. So it can only be used in *open* atm. Which is such a small use-case! wouldn't it have been better for developer to new_flags = original_flags & ~O_NONBLOCK way to remove the flag.
As much as I would like to see new feature added to io_uring I also wouldn't want to waste resources that could be better used!