I'm hoping to take advantage of io_uring to minimize syscall count, but I can't get anywhere close to non-io_uring performance. My application is a fast version of rm: every directory is independently deleted on its own thread.

Here is my io_uring patch: https://github.com/SUPERCILEX/fuc/commit/a01a22bdc599d53fa8f62387ef95cdf4fd6e8aae?w=1. The gist is that I queue up a bunch of unlinks and then submit + wait for them to complete. I have to periodically reap the submission queue b/c I'm using getdents64 and passing the unlink path pointers from that directory buffer (so the unlinks must be done before I can make the next getdents call).

I've experimented with a bunch of options:

IORING_SETUP_COOP_TASKRUN and IORING_SETUP_SINGLE_ISSUER don't seem to make a difference.
IORING_SETUP_DEFER_TASKRUN is ~10% worse which is surprising.
128 entries seems optimal. More and things get a little slower. Very surprising is that a low entry count is extremely slow. With only 1 entry we're talking a 3x slowdown which I don't understand. I'd expare par performance with non io_uring since the syscall count should be equivalent.
Using either iowq_max_workers = [1, 0] or IO_LINK prevents a performance cliff from every file in a directory being deleted on its own thread.

Ideally, I'd like to be able to tell io_uring to execute the submission queue on the calling thread since that's what will be most efficient. I would have hoped that IORING_SETUP_DEFER_TASKRUN did that, but it does not appear to be the case.

Linux 6.2.6-76060206-generic

Benchmark with hyperfine --warmup 3 -N --prepare "ftzz g -n 100K /tmp/ftzz" "./old /tmp/ftzz" "./new /tmp/ftzz" where old is master and new is the branched binary, both built with cargo b --release (copy from target/release/rmz).

Benchmark 1: ./old /tmp/ftzz
  Time (mean ± σ):      23.9 ms ±   1.6 ms    [User: 10.2 ms, System: 285.4 ms]
  Range (min … max):    21.0 ms …  28.4 ms    59 runs
 
Benchmark 2: ./new /tmp/ftzz
  Time (mean ± σ):      31.8 ms ±   1.8 ms    [User: 4.4 ms, System: 361.6 ms]
  Range (min … max):    28.3 ms …  36.1 ms    49 runs
 
Summary
  './old /tmp/ftzz' ran
    1.33 ± 0.12 times faster than './new /tmp/ftzz'

Mar 22 '23 21:03 SUPERCILEX

You use IORING_SETUP_SINGLE_ISSUER , but share io_uring across threads. Try creating io_uring in each thread

Mar 23 '23 16:03 redbaron

Oh yep, I forgot to mention that I already have a uring per thread. IORING_SETUP_ATTACH_WQ didn't seem to make a difference.

Mar 23 '23 16:03 SUPERCILEX

Some opcodes are just never going to be super fast, until they get converted to being able to do nonblocking issue. UNLINK is one of them - it'll always return -EAGAIN on inline issue, which means it gets punted to the internal io-wq pool for handling. This is why it isn't always faster than a sync unlink(), as even if we could perform it without blocking, we're still offloading it to a thread.

The situation might change if you do:

echo 3 > /proc/sys/vm/drop_caches

before the run, as it'd be cache cold at that point. But depending on what you unlink, you would probably only still have a few of the sync unlinks blocking before the meta data is cached.

Mar 23 '23 23:03 axboe

Is there documentation that explains the io_uring architecture? I don't know what it would mean for a syscall to be non-blocking (though I'm guessing it means start the work and set up some interrupt to be notified when it's done).

If anything, I think I want blocking behavior because that should remove pointless overhead waiting on queues. It'd be neat if there was a "do it inline anyway" flag or perhaps that can be inferred from a iowq_max_workers = [1, 0] + submit_and_wait(queue_size) call (though a flag is probably clearer).

Mar 23 '23 23:03 SUPERCILEX

t'd be neat if there was a "do it inline anyway" flag or perhaps that can be inferred from a iowq_max_workers = [1, 0] + submit_and_wait(queue_size) call (though a flag is probably clearer).

It can't be inferred from that, but yes I agree, I have considered that. It'd be trivial to add as an IORING_ENTER_INLINE flag or similar.

Mar 23 '23 23:03 axboe

I can do a quick hack of that if you want to test it...

Mar 23 '23 23:03 axboe

I can do a quick hack of that if you want to test it...

If you'll let me know how to test it :sweat_smile:, then sure!

Mar 23 '23 23:03 SUPERCILEX

Something like the below, totally untested...

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1d59c816a5b8..539b1e3ac695 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -442,6 +442,7 @@ struct io_cqring_offsets {
 #define IORING_ENTER_SQ_WAIT		(1U << 2)
 #define IORING_ENTER_EXT_ARG		(1U << 3)
 #define IORING_ENTER_REGISTERED_RING	(1U << 4)
+#define IORING_ENTER_INLINE		(1U << 5)
 
 /*
  * Passed in for io_uring_setup(2). Copied back with updated info on success
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 24be4992821b..dfb52497e49e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2078,12 +2078,11 @@ static void io_queue_async(struct io_kiocb *req, int ret)
 		io_queue_linked_timeout(linked_timeout);
 }
 
-static inline void io_queue_sqe(struct io_kiocb *req)
-	__must_hold(&req->ctx->uring_lock)
+static inline void __io_queue_sqe(struct io_kiocb *req, unsigned issue_flags)
 {
 	int ret;
 
-	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
+	ret = io_issue_sqe(req, issue_flags);
 
 	/*
 	 * We async punt it if the file wasn't marked NOWAIT, or if the file
@@ -2095,6 +2094,12 @@ static inline void io_queue_sqe(struct io_kiocb *req)
 		io_queue_async(req, ret);
 }
 
+static inline void io_queue_sqe(struct io_kiocb *req)
+	__must_hold(&req->ctx->uring_lock)
+{
+	__io_queue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
+}
+
 static void io_queue_sqe_fallback(struct io_kiocb *req)
 	__must_hold(&req->ctx->uring_lock)
 {
@@ -2295,10 +2300,11 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
 }
 
 static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			 const struct io_uring_sqe *sqe)
+			 const struct io_uring_sqe *sqe, bool force_inline)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_link *link = &ctx->submit_state.link;
+	unsigned issue_flags = IO_URING_F_COMPLETE_DEFER;
 	int ret;
 
 	ret = io_init_req(ctx, req, sqe);
@@ -2344,6 +2350,9 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		return 0;
 	}
 
+	if (!force_inline)
+		issue_flags |= IO_URING_F_NONBLOCK;
+	__io_queue_sqe(req, issue_flags);
 	io_queue_sqe(req);
 	return 0;
 }
@@ -2425,7 +2434,7 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 	return false;
 }
 
-int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
+int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, bool force_inline)
 	__must_hold(&ctx->uring_lock)
 {
 	unsigned int entries = io_sqring_entries(ctx);
@@ -2454,7 +2463,7 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		 * Continue submitting even for sqe failure if the
 		 * ring was setup with IORING_SETUP_SUBMIT_ALL
 		 */
-		if (unlikely(io_submit_sqe(ctx, req, sqe)) &&
+		if (unlikely(io_submit_sqe(ctx, req, sqe, force_inline)) &&
 		    !(ctx->flags & IORING_SETUP_SUBMIT_ALL)) {
 			left--;
 			break;
@@ -3465,7 +3474,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 
 	if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
 			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG |
-			       IORING_ENTER_REGISTERED_RING)))
+			       IORING_ENTER_REGISTERED_RING |
+			       IORING_ENTER_INLINE)))
 		return -EINVAL;
 
 	/*
@@ -3521,7 +3531,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 			goto out;
 
 		mutex_lock(&ctx->uring_lock);
-		ret = io_submit_sqes(ctx, to_submit);
+		ret = io_submit_sqes(ctx, to_submit, flags & IORING_ENTER_INLINE);
 		if (ret != to_submit) {
 			mutex_unlock(&ctx->uring_lock);
 			goto out;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2711865f1e19..ff90ebb33850 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -62,7 +62,7 @@ int io_uring_alloc_task_context(struct task_struct *task,
 				struct io_ring_ctx *ctx);
 
 int io_poll_issue(struct io_kiocb *req, bool *locked);
-int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
+int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, bool force_inline);
 int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin);
 void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node);
 int io_req_prep_async(struct io_kiocb *req);
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 0119d3f1a556..fcf1f1753fde 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -190,7 +190,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
 		 */
 		if (to_submit && likely(!percpu_ref_is_dying(&ctx->refs)) &&
 		    !(ctx->flags & IORING_SETUP_R_DISABLED))
-			ret = io_submit_sqes(ctx, to_submit);
+			ret = io_submit_sqes(ctx, to_submit, false);
 		mutex_unlock(&ctx->uring_lock);
 
 		if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait))

Mar 24 '23 00:03 axboe

Easiest way to test would be to just enable it unconditionally in liburing and recompile the library and your test app. Something like this, again utterly untested:

diff --git a/src/queue.c b/src/queue.c
index b784b10c0996..c1525eb89ce1 100644
--- a/src/queue.c
+++ b/src/queue.c
@@ -69,7 +69,7 @@ static int _io_uring_get_cqe(struct io_uring *ring,
 
 	do {
 		bool need_enter = false;
-		unsigned flags = 0;
+		unsigned flags = (1U << 5);
 		unsigned nr_available;
 		int ret;
 
@@ -93,7 +93,7 @@ static int _io_uring_get_cqe(struct io_uring *ring,
 			need_enter = true;
 		}
 		if (data->wait_nr > nr_available || need_enter) {
-			flags = IORING_ENTER_GETEVENTS | data->get_flags;
+			flags |= IORING_ENTER_GETEVENTS | data->get_flags;
 			need_enter = true;
 		}
 		if (sq_ring_needs_enter(ring, data->submit, &flags))
@@ -378,6 +378,7 @@ static int __io_uring_submit(struct io_uring *ring, unsigned submitted,
 			flags |= IORING_ENTER_GETEVENTS;
 		if (ring->int_flags & INT_FLAG_REG_RING)
 			flags |= IORING_ENTER_REGISTERED_RING;
+		flags |= (1U << 5);
 
 		ret = __sys_io_uring_enter(ring->enter_ring_fd, submitted,
 					   wait_nr, flags, NULL);

We'd probably want this to be a specific io_uring_submit_foo() helper instead, but this will do for testing. Would be interesting to re-run your unlink test with that.

Mar 24 '23 00:03 axboe

For that first patch, is there a way to include it in a live kernel? Or do I need to build and boot my own kernel? Any pointers would be appreciated.

Mar 24 '23 00:03 SUPERCILEX

You need to patch and build the kernel, I'm afraid. If you want, send me your test app and I can run it here and we can compare. That might be easier.

Mar 24 '23 00:03 axboe

If you want, send me your test app and I can run it here and we can compare. That might be easier.

Yes please :). If you'd like, I can build binaries for you provided a target architecture (I assume x86_64?). Otherwise:

# Install rust if necessary: https://doc.rust-lang.org/cargo/getting-started/installation.html
git clone [email protected]:SUPERCILEX/fuc.git
cd fuc
git checkout io_uring_rmz

cargo +nightly build --release --workspace
cp target/release/rmz force

git checkout a01a22
cargo +nightly build --release --workspace
cp target/release/rmz patchless

# Benchmark
cargo install hyperfine
hyperfine --warmup 3 -N --prepare "ftzz g -n 100K /tmp/ftzz" "./patchless /tmp/ftzz" "./force /tmp/ftzz"

Note that I disabled IOSQE_IO_LINK chaining since I assume it's not necessary if everything is run inline?

Mar 24 '23 01:03 SUPERCILEX

axboe@r7525 ~/gi/fuc (io_uring_rmz)> hyperfine --warmup 3 -N --prepare "ftzz g -n 1000K /dev/shm/ftzz" "taskset -c 0 ./patchless /dev/shm/ftzz" "taskset -c 0 ./force /dev/shm/ftzz"
Benchmark 1: taskset -c 0 ./patchless /dev/shm/ftzz
  Time (mean ± σ):      2.078 s ±  0.038 s    [User: 0.008 s, System: 2.066 s]
  Range (min … max):    2.019 s …  2.127 s    10 runs
 
Benchmark 2: taskset -c 0 ./force /dev/shm/ftzz
  Time (mean ± σ):      1.808 s ±  0.033 s    [User: 0.012 s, System: 1.785 s]
  Range (min … max):    1.771 s …  1.882 s    10 runs
 
Summary
  'taskset -c 0 ./force /dev/shm/ftzz' ran
    1.15 ± 0.03 times faster than 'taskset -c 0 ./patchless /dev/shm/ftzz'

Mar 24 '23 13:03 axboe

Not sure how representative this is on tmpfs, as profiles show we're spending most of the time on grabbing a lock for eviction:

+   85.36%  force    [kernel.vmlinux]  [k] queued_spin_lock_slowpath

Mar 24 '23 13:03 axboe

Going to assume that's why you pinned the benches to a core.

You patch does fix the slowdown! I ran the same benchmark but between uring and non-uring (with /tmp as a tmpfs) and found that uring was 15% slower than non-uring. Since your results show a 15% speedup with force, I'm pretty sure that puts uring at parity with normal syscalls.

Now whether or not uring would actually be faster for rm isn't clear. I think the cp case has a lot more potential since it has way more syscalls: open-stat-open-copy_file_range-close-close. With fixed files, the two closes can be deleted entirely, open-stat can be submitted in one go, and so can open-copy_file_range. That should give uring a pretty big efficiency advantage that I would hope translates to better performance.

$ hyperfine --warmup 3 -N --prepare "ftzz g -n 1M /tmp/ftzz" "taskset -c 0 ./normal /tmp/ftzz" "taskset -c 0 ./uring /tmp/ftzz" 
Benchmark 1: taskset -c 0 ./normal /tmp/ftzz
  Time (mean ± σ):      1.675 s ±  0.098 s    [User: 0.053 s, System: 1.521 s]
  Range (min … max):    1.606 s …  1.834 s    10 runs
 
Benchmark 2: taskset -c 0 ./uring /tmp/ftzz
  Time (mean ± σ):      1.928 s ±  0.012 s    [User: 0.010 s, System: 1.853 s]
  Range (min … max):    1.919 s …  1.959 s    10 runs
 
Summary
  'taskset -c 0 ./normal /tmp/ftzz' ran
    1.15 ± 0.07 times faster than 'taskset -c 0 ./uring /tmp/ftzz'

Mar 24 '23 20:03 SUPERCILEX

Also note that my testing was done with all kinds of security mitigations turned off, obviously the batching done would greatly benefit with mitigations turned ON as the syscalls and switches become more expensive there. IOW, it's worst case for the comparison.

I think the main questions to ponder here are:

Should this be an in_uring_enter(2) flag, or should it be a ring setup flag? If the former, we probably want to use a FEAT flag for it too.
What's a good name? INLINE kind of describes it, but only if you know the internals. Suggestions welcome!

I changed various things in the patch I tested, once we get a final design nailed down, I'll post them as well.

Mar 24 '23 20:03 axboe

I think the cp case has a lot more potential since it has way more syscalls: open-stat-open-copy_file_range-close-close. With fixed files, the two closes can be deleted entirely, open-stat can be submitted in one go, and so can open-copy_file_range. That should give uring a pretty big efficiency advantage that I would hope translates to better performance.

Definitely agree.

Mar 24 '23 20:03 axboe

Should this be an in_uring_enter(2) flag, or should it be a ring setup flag? If the former, we probably want to use a FEAT flag for it too.

Is it possible to put the flag in io_uring_sqe entries? So each entry can choose where it wants to go? Otherwise, in_uring_enter seems more flexible since you could decide to change it up without having to recreate the uring.

What's a good name? INLINE kind of describes it, but only if you know the internals. Suggestions welcome!

Maybe NO_ASYNC? Since we already have ASYNC, it kinda makes sense to have a don't-do-that option. But I'm not sure if it's an accurate name.

Mar 24 '23 22:03 SUPERCILEX

NO_ASYNC isn't great, as it may very well be async. Eg if you can queue it up for normal async execution, then you'd still do that. The point of the flag is just to do those opcodes that would otherwise always punt to io-wq inline, instead of punting them to io-wq. NO_OFFLOAD? NO_IOWQ? NO_IO_WORKER?

Mar 25 '23 03:03 axboe

Gotya, INLINE, NO_OFFLOAD, and NO_IOWQ all seem pretty reasonable. No real preference unfortunately. :)

Mar 25 '23 04:03 SUPERCILEX

It could go in the sqe, but there's only one flag left for use. And I don't think people will generally mix and match these - either whatever is using the ring is fine with potential submission stalls due to blocking, or it's not. Hence I do think a setup flag would be just fine, but the enter flag is a bit more flexible. Per-sqe flag probably going too far.

I'm partial to NO_IOWQ so I'll spin it like that.

Mar 25 '23 14:03 axboe

Makes sense about the sqes. I still think this flag would be more useful with enter (for example maybe the copy_file_ranges would be faster if executed in parallel), but I'm not sure how this feature would interact with IORING_SETUP_SQPOLL as they seem incompatible.

Mar 25 '23 18:03 SUPERCILEX

Agree, enter is more useful for sure. But yes, for SQPOLL, it'd have to be a setup flag.

Mar 25 '23 23:03 axboe

But yes, for SQPOLL, it'd have to be a setup flag.

As in putting the NO_IOWQ flag on enter is a problem? If not, I don't see why you'd use SQPOLL with NO_IOWQ so putting NO_IOWQ on enter should be fine.

Mar 25 '23 23:03 SUPERCILEX

Not sure that SQPOLL and NO_IOWQ would be mutually exclusive, though.

Mar 25 '23 23:03 axboe

Hmmm, this is kinda gross, but could enter be stateful for SQPOLL? As in it says "everything processed after now will be NO_IOWQ." The semantics of when "now" is seem unpleasant to pin down. That's where making the flag part of each entry simplifies things, but using up the last flag does seem a little excessive. So I guess being part of setup is simplest, but forcing the uring to stay NO_IOWQ is a bit of bummer.

Mar 26 '23 02:03 SUPERCILEX

You can't really synchronize with SQPOLL like that. By the time you've filled in the SQEs you want to submit, SQPOLL could already be processing them. Or vice versa, you set it and it's still processing entries from previously. I think that'd be a mess to try and coordinate, and you really can't unless you have the thread idle on either side of that sequence. Or use on the SQ ring flags for it, which would still be racy. For SQPOLL, if we don't have SQE flags, then it'd have to be a per-instance kind of thing. In practice, a ring setup flag would be fine. If you have a need for both kinds of behavior, you could always setup two rings and have them share the SQPOLL thread. Then while processing ring A it could be in NO_IOWQ mode, and while processing ring B it would be in the normal mode. Without an SQE flag, SQPOLL would have to be setup a bit differently.

For the non-SQPOLL case, I don't like the ring setup flag at all, I do think the enter flag is a lot cleaner.

Mar 26 '23 12:03 axboe

The reasons for some using this per enter call and others per ring seem clear enough. The reason for a name like NO_IOWQ seemed clear, it causes the operations found in the squeue to run in an immediate mode, without punting to the worker queue if the operation has to be waited on.

But the comment

if you can queue it up for normal async execution, then you'd still do that. The point of the flag is just to do those opcodes that would otherwise always punt to io-wq inline, instead of punting them to io-wq.

left me wondering what was meant. The man pages generally don't talk about which operations can be queued up for normal async execution do they? When this feature lands, will it be easy to describe for which operations this is useful?

I think here, as a first use case, the emphasis has been on how to make unlink more efficient when many unlinks can be batched. What other operations or operation types might one consider this new feature for? Answers can wait for the man page(s). Maybe how the new feature affects thinks like linked operations will also be clear then.

Thank you. -- Interested Bystander

Mar 26 '23 14:03 FrankReh

@axboe so are you suggesting we have the same flag in both setup and enter? If so, that seems pretty reasonable to me.

@FrankReh I like the idea of documenting which ops are pollable, that'd be helpful.

Mar 26 '23 17:03 SUPERCILEX

I don't think we need to document which ops are pollable, because it's mostly just a file thing. If the file is pollable, then we can poll for it. It's that simple. So we roughly have these classes:

Ops that can either poll or not, fully dependent on what file is being used. These are things like READ/WRITE.
Ops that always access pollable files, like SEND/RECV.
Ops that only access non-pollable files, or where IO is not necessarily being done deliberately. Some of these are fine async, some of them are not. It's this last category that would be helped by NO_IOWQ. Mostly fs operations, examples are (of course) unlink, fadvise, madvise, etc.

There's also the question if this NO_IOWQ flag would impact operations that are only the category 3 ones, or if it applies generally. That is more tricky than it might seem, unless we still retain nonblocking issue and go blocking if we otherwise would've punted to io-wq. That would be a bit wasteful, but should always work. We can't just do a blocking issue to begin with. Think of file types like sockets or pipes, we could be stalled there forever. Do we even want to include those in NO_IOWQ?

The patch is conceptually simple, but that's because it's just a bit of a hack to deal with requests that complete in guaranteed bounded time. Unlink will always complete. It may take a bit since it may need to do IO, but it will complete. That's not always the case for any request, or any request with any file type that it may take.

Mar 26 '23 21:03 axboe

liburing
liburing copied to clipboard

Poor performance on file operations

echo 3 > /proc/sys/vm/drop_caches

liburing liburing copied to clipboard

Poor performance on file operations

echo 3 > /proc/sys/vm/drop_caches

liburing
liburing copied to clipboard