liburing
liburing copied to clipboard
Question: IORING_ENTER_GETEVENTS vs dumb busy-wait major performance diff
Hello! This could be a dumb question but I wanted to know if this performance difference is expected.
My program is written in Zig (the stdlib io_uring library is based on liburing), so the issue could be there. If this performance difference isn't expected I'll try to reproduce it using C...
Big picture: I'm running one process with two threads (server and client) that are just opening a TCP socket and sending a "PING" responding with a "PONG" for 100,000 pings. Each thread has its own io_uring. Besides the socket setup and teardown, they are just queueing send
and recv
repeatedly.
What I'm seeing:
- When I busy loop
io_uring_submit
andio_uring_peek_batch_cqe
my ping/pong server is getting ~100k req/s.
-
io_uring_enter
syscall count:4_057_711
(measured withperf
) - CPU: 100%
- When I do the above, but add a
while (io_uring_cq_ready() == 0) {}
busy loop before the peek, I still get ~100k req/s, but my syscall count drops a bunch since I'm mostly busy waiting on the userspace cq buffer.
-
io_uring_enter
syscall count:200_008
- CPU: 100%
- When I change to using
io_uring_submit_and_wait(1)
andio_uring_peek_batch_cqe
(no cq_ready loop), I drop to ~20k req/s, but keep the low syscall count.
-
io_uring_enter
syscall count:200_008
- CPU: ~30%
Kernel version: 6.1.2 aarch64
I guess my surprise is that with scenario 1 above, I make 20x more syscalls and yet my performance remains pretty much the same vs when I use submit_and_wait
. I would've thought the triggering of submit_and_wait
would've performed much better than it is.
Of course, scenario 1 and 2 above both use 100% CPU while scenario 3 uses significantly less (30% or so). But, still, I would've thought that the submit_and_wait
syscall would be triggered much more quickly. That paired with the 20x less syscalls I thought would add up to some noticeable differences.
(I am submitting roughly 200,000 SQEs, so the 200,008 is probably as close to ideal on the number of syscalls I can get w/o SQ polling)
I feel like I'm probably just fundamentally misunderstanding something or my understanding of the performance characteristics of the syscall is way off, but I wanted to double check since it seems right now its either 100% CPU and 5x the performance, or 30% CPU and 1/5th the performance and I'm not sure if that's expected.
Thank you! io_uring
is awesome.
it's difficult to know for sure - but likely this is the overhead of the "wait" part, which is a lot of scheduler and context switching overhead since it has to go to sleep and wake up for each event.
if you are curious you could use perf to see where the cycles are spent
I'm quite an amateur at perf
, but I can see that the version that uses submit_and_wait
does context switch exactly 200,008 more times. Looking at a call graph its hard to tell where the time is being spent otherwise.
Recording sched:sched_switch
events for the slow version I can see it is mostly switching into the swapper
task. The fast versions have almost zero context switches so that recording there is not useful.
So perhaps this is just context switch overhead along with kernel timers overhead? Its still surprisingly high to me... but if thats the case I just need to recalibrate my intuitions I think.
# Children Self Samples Trace output
# ........ ........ ............ ..........................................................................................................................
#
11.07% 11.07% 109842 prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/4 next_pid=0 next_prio=120
9.12% 9.12% 90483 prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/4 next_pid=0 next_prio=120
9.08% 9.08% 90074 prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
8.89% 8.89% 88241 prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
8.66% 8.66% 85909 prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
8.59% 8.59% 85196 prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=ping-pongs next_pid=205677 next_prio=120
8.50% 8.50% 84298 prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
8.48% 8.48% 84108 prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=ping-pongs next_pid=205679 next_prio=120
7.60% 7.60% 75371 prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
7.54% 7.54% 74844 prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
4.93% 4.93% 48868 prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
4.78% 4.78% 47402 prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
0.83% 0.83% 8202 prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=.perf-wrapped next_pid=205676 next_prio=120
0.77% 0.77% 7647 prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=.perf-wrapped next_pid=205676 next_prio=120
0.42% 0.42% 4159 prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=R+ ==> next_comm=ping-pongs next_pid=205679 next_prio=120
0.32% 0.32% 3131 prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=R+ ==> next_comm=ping-pongs next_pid=205677 next_prio=120
... snipped ...
Yeah I'm pretty confident you're right that this is just timer and context overhead from the wait. If I increase the number of threads to be more than the CPU count, then submit_and_wait
performs better than the busy loop (probably because more threads are switched to and the server thread can process more work in a batch). If the number of threads is less than the CPU count (and there isn't other work thats critical to run on the system), then busy waiting results in much, much better performance but at the cost of CPU.
I'm not surprised that the number of context switches increased, it was probably burning CPU busy waiting in userspace prior to that and now it spends this time doing side jobs. I'm a bit surprised by that 100K -> 20K drop though.
Can you attach the code for 1. and 3.? Would be also interesting to see profiles for those two.
perf record -- ./your_program
perf report > report.txt
@mitchellh, Any progress on this? Is there a reproducer we can try?