liburing icon indicating copy to clipboard operation
liburing copied to clipboard

Question: IORING_ENTER_GETEVENTS vs dumb busy-wait major performance diff

Open mitchellh opened this issue 2 years ago • 5 comments

Hello! This could be a dumb question but I wanted to know if this performance difference is expected.

My program is written in Zig (the stdlib io_uring library is based on liburing), so the issue could be there. If this performance difference isn't expected I'll try to reproduce it using C...

Big picture: I'm running one process with two threads (server and client) that are just opening a TCP socket and sending a "PING" responding with a "PONG" for 100,000 pings. Each thread has its own io_uring. Besides the socket setup and teardown, they are just queueing send and recv repeatedly.

What I'm seeing:

  1. When I busy loop io_uring_submit and io_uring_peek_batch_cqe my ping/pong server is getting ~100k req/s.
  • io_uring_enter syscall count: 4_057_711 (measured with perf)
  • CPU: 100%
  1. When I do the above, but add a while (io_uring_cq_ready() == 0) {} busy loop before the peek, I still get ~100k req/s, but my syscall count drops a bunch since I'm mostly busy waiting on the userspace cq buffer.
  • io_uring_enter syscall count: 200_008
  • CPU: 100%
  1. When I change to using io_uring_submit_and_wait(1) and io_uring_peek_batch_cqe (no cq_ready loop), I drop to ~20k req/s, but keep the low syscall count.
  • io_uring_enter syscall count: 200_008
  • CPU: ~30%

Kernel version: 6.1.2 aarch64

I guess my surprise is that with scenario 1 above, I make 20x more syscalls and yet my performance remains pretty much the same vs when I use submit_and_wait. I would've thought the triggering of submit_and_wait would've performed much better than it is.

Of course, scenario 1 and 2 above both use 100% CPU while scenario 3 uses significantly less (30% or so). But, still, I would've thought that the submit_and_wait syscall would be triggered much more quickly. That paired with the 20x less syscalls I thought would add up to some noticeable differences.

(I am submitting roughly 200,000 SQEs, so the 200,008 is probably as close to ideal on the number of syscalls I can get w/o SQ polling)

I feel like I'm probably just fundamentally misunderstanding something or my understanding of the performance characteristics of the syscall is way off, but I wanted to double check since it seems right now its either 100% CPU and 5x the performance, or 30% CPU and 1/5th the performance and I'm not sure if that's expected.

Thank you! io_uring is awesome.

mitchellh avatar Jan 10 '23 06:01 mitchellh

it's difficult to know for sure - but likely this is the overhead of the "wait" part, which is a lot of scheduler and context switching overhead since it has to go to sleep and wake up for each event.

if you are curious you could use perf to see where the cycles are spent

DylanZA avatar Jan 10 '23 08:01 DylanZA

I'm quite an amateur at perf, but I can see that the version that uses submit_and_wait does context switch exactly 200,008 more times. Looking at a call graph its hard to tell where the time is being spent otherwise.

Recording sched:sched_switch events for the slow version I can see it is mostly switching into the swapper task. The fast versions have almost zero context switches so that recording there is not useful.

So perhaps this is just context switch overhead along with kernel timers overhead? Its still surprisingly high to me... but if thats the case I just need to recalibrate my intuitions I think.

# Children      Self       Samples  Trace output                                                                                                              
# ........  ........  ............  ..........................................................................................................................
#
    11.07%    11.07%        109842  prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/4 next_pid=0 next_prio=120
     9.12%     9.12%         90483  prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/4 next_pid=0 next_prio=120
     9.08%     9.08%         90074  prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
     8.89%     8.89%         88241  prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
     8.66%     8.66%         85909  prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
     8.59%     8.59%         85196  prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=ping-pongs next_pid=205677 next_prio=120
     8.50%     8.50%         84298  prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
     8.48%     8.48%         84108  prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=ping-pongs next_pid=205679 next_prio=120
     7.60%     7.60%         75371  prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
     7.54%     7.54%         74844  prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
     4.93%     4.93%         48868  prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
     4.78%     4.78%         47402  prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
     0.83%     0.83%          8202  prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=S ==> next_comm=.perf-wrapped next_pid=205676 next_prio=120
     0.77%     0.77%          7647  prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=S ==> next_comm=.perf-wrapped next_pid=205676 next_prio=120
     0.42%     0.42%          4159  prev_comm=ping-pongs prev_pid=205677 prev_prio=120 prev_state=R+ ==> next_comm=ping-pongs next_pid=205679 next_prio=120
     0.32%     0.32%          3131  prev_comm=ping-pongs prev_pid=205679 prev_prio=120 prev_state=R+ ==> next_comm=ping-pongs next_pid=205677 next_prio=120

... snipped ...

mitchellh avatar Jan 10 '23 16:01 mitchellh

Yeah I'm pretty confident you're right that this is just timer and context overhead from the wait. If I increase the number of threads to be more than the CPU count, then submit_and_wait performs better than the busy loop (probably because more threads are switched to and the server thread can process more work in a batch). If the number of threads is less than the CPU count (and there isn't other work thats critical to run on the system), then busy waiting results in much, much better performance but at the cost of CPU.

mitchellh avatar Jan 10 '23 17:01 mitchellh

I'm not surprised that the number of context switches increased, it was probably burning CPU busy waiting in userspace prior to that and now it spends this time doing side jobs. I'm a bit surprised by that 100K -> 20K drop though.

Can you attach the code for 1. and 3.? Would be also interesting to see profiles for those two.

perf record -- ./your_program
perf report > report.txt

isilence avatar Jan 11 '23 18:01 isilence

@mitchellh, Any progress on this? Is there a reproducer we can try?

isilence avatar Apr 11 '23 12:04 isilence