liburing icon indicating copy to clipboard operation
liburing copied to clipboard

io_uring is slower than epoll

Open ghost opened this issue 5 years ago • 155 comments

EDIT: I have made available detailed benchmark with epoll that shows this in a reliable way: https://github.com/alexhultman/io_uring_epoll_benchmark


Don't get me wrong, since I heard about io_uring I've been all about trying to scientifically reproduce any claim that it improves performance (theoretically it should, since you can have fewer syscalls and I like that idea). I would love to add support for io_uring in my projects, but I won't touch it until someone can scientifically show me proofs of it outperforming epoll (significantly).

I've tested three different examples claiming to be faster than epoll, tested on Linux 5.7 without spectre mitigations, tested on Linux 5.8 with spectre mitigations. Tested Clear Linux, tested Fedora. All tests points towards epoll being faster for my kind of test.

What I have is a simple TCP echo server echoing small chunks of data for 40-400 clients. In all tests epoll is performing better by a measurable amount. In no single test have I seen io_uring beat epoll in this kind of test.

Some io_uring examples are horribly slow, while others are quite close to epoll perforamance. The closest I have seen it go is this one: https://github.com/frevib/io_uring-echo-server comparing to this one: https://github.com/frevib/epoll-echo-server

I have tested on two separate and different machines, both with the same outcome: epoll wins.

Can someone enlighten me on how I can get io_uring to outperform epoll in this case?

ghost avatar Aug 30 '20 10:08 ghost

https://github.com/frevib/io_uring-echo-server/issues/8

ghost avatar Aug 30 '20 14:08 ghost

That's definitely a great thing to do. I don't know whether anybody have time for that at the moment, though. io_uring-echo-server looks much bulkier from the last time I've seen it.

isilence avatar Sep 02 '20 15:09 isilence

Maybe your benchmark is full of errors

InternalHigh avatar Sep 13 '20 11:09 InternalHigh

Maybe your benchmark is full of errors

It's not my benchmark and the benchmark I reference is the one @axboe himself has referenced on Twitter, showing that he interprets it as true. That's why I would like @axboe himself to write a benchmark without issues so that we can actually prove this thing works better.

ghost avatar Sep 13 '20 11:09 ghost

For what it’s worth there is a whole lot more to it than trivial one to one comparison.

  • you save many syscalls that would be required to modify the epolll kernel state (adding, removing monitored FDs, updating per FD state etc). This becomes extremely important as the number of managed FDs increases - exponentially so.
  • you can register FDs (eg listener FDs and long lived connections FDs) which means you dont have to incur the cost of kernel looking up the file struct and checking for access every time.
  • and, obviously, you don’t get to monitor just network sockets. You can do so much more.

@markpapadakis

On 30 Aug 2020, at 1:45 PM, Alex Hultman [email protected] wrote:

 Don't get me wrong, since I heard about io_uring I've been all about trying to scientifically reproduce any claim that it improves performance (theoretically it should, since you can have fewer syscalls and I like that idea). I would love to add support for io_uring in my projects, but I won't touch it until someone can scientifically show me proofs of it outperforming epoll (significantly).

I've tested three different examples claiming to be faster than epoll, tested on Linux 5.7 without spectre mitigations, tested on Linux 5.8 with spectre mitigations. Tested Clear Linux, tested Fedora. All tests points towards epoll being faster for my kind of test.

What I have is a simple TCP echo server echoing small chunks of data for 40-400 clients. In all tests epoll is performing better by a measurable amount. In no single test have I seen io_uring beat epoll in this kind of test.

Some io_uring examples are horribly slow, while others are quite close to epoll perforamance. The closest I have seen it go is this one: https://github.com/frevib/io_uring-echo-server comparing to this one: https://github.com/frevib/epoll-echo-server

I have tested on two separate and different machines, both with the same outcome: epoll wins.

Can someone enlighten me on how I can get io_uring to outperform epoll in this case?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

markpapadakis avatar Sep 13 '20 12:09 markpapadakis

@markpapadakis Have you taken a look at the benchmarks that are quoted above? They use many connections at once, thus many FDs at once. uring still consistently performs worse, in conditions that mimic production usage well enough without actually being production.

Part of the scentific process is to be able to reproduce claims like those being made (60%+ increase in performance over epoll, apparently that was 99% at one point but bugs were found). These are metrics that @axboe has not refuted and has seemingly even confirmed, especially through promoting it on twitter and urging others to buy in (e.g. Netty).

Even if this were a 5% increase, I'd be for it. However, I simply do not understand where these results are coming from, and everyone on twitter seems to be more interested in patting themselves on the back rather than addressing criticisms of things (sorry if that sounds harsh).

Outrageous claims require outrageous evidence, especially when it comes to claims that could completely renovate the space.

Qix- avatar Sep 30 '20 02:09 Qix-

@markpapadakis We all know the theory. It is not that hard to understand, quite basic actually.

But theory means nothing if actual reality mismatches with theorized conclusions (any scientist ever).

I would be happy and excited if io_uring could improve performances but so far no benchmark can show this.

All I'm asking for is scientific proofs. I am a scientist, not a believer.

ghost avatar Sep 30 '20 06:09 ghost

I think it is time to take this to the Linux kernel mailing list since @axboe has ignored this criticism entirely.

ghost avatar Sep 30 '20 06:09 ghost

@alexhultman Check netty/netty#10622 - I won't have any bandwidth for the next day or two, but maybe that will pique your interest.

Qix- avatar Sep 30 '20 11:09 Qix-

I think that @axboe knows that uring is slower. Otherwise he would answer.

InternalHigh avatar Sep 30 '20 20:09 InternalHigh

I have not ignored any of this, but I've been way too busy on other items. And frankly, the way the tone has shifted in here, it's not really providing much impetus to engage. I did run the echo server benchmarks back then, and it did reproduce for me. Haven't looked at it since, it's not like I daily run echo server benchmarks.

So in short, I'm of course interested in cases where io_uring isn't performing to its full potential, it most certainly should. I have a feeling that #215 might explain some of these. Once I get myself out from under the pressure I'm at now, I'll re-run the benchmarks myself.

axboe avatar Sep 30 '20 20:09 axboe

I can testify that io_uring is much faster than epoll. Please use kernel 5.7.15 for your benchmarks. This server https://github.com/romange/gaia/tree/master/examples/pingserver reaches 3M qps on a single instance for redis-benchmark (ping_inline API) on c5n ec2 instances.

romange avatar Sep 30 '20 21:09 romange

I have made available a new, detailed benchmark that shows io_uring is reliably slower than epoll:

https://github.com/alexhultman/io_uring_epoll_benchmark

I can testify that io_uring is much faster than epoll. Please use kernel 5.7.15 for your benchmarks. This server https://github.com/romange/gaia/tree/master/examples/pingserver reaches 3M qps on a single instance for redis-benchmark (ping_inline API) on c5n ec2 instances.

And where is your 1-to-1 comparison with epoll? Or you just go by feeling of "high numbers"? See my benchmark for a 1-to-1 comparison.

ghost avatar Dec 07 '20 11:12 ghost

For what it’s worth there is a whole lot more to it than trivial one to one comparison. - you save many syscalls that would be required to modify the epolll kernel state (adding, removing monitored FDs, updating per FD state etc). This becomes extremely important as the number of managed FDs increases - exponentially so. - you can register FDs (eg listener FDs and long lived connections FDs) which means you dont have to incur the cost of kernel looking up the file struct and checking for access every time. - and, obviously, you don’t get to monitor just network sockets. You can do so much more.

And were is your benchmark for this claim? I know the theory very well, but you're just assuming theory is correct here because it must be. See my posted benchmark - it shows the complete opposite of what you claim.

The bennchmark I have posted performs ZERO syscalls and does ZERO copies, yet epoll wins reliably despite doing millions of syscalls and performing copies in every syscall.

ghost avatar Dec 07 '20 11:12 ghost

@axboe

So in short, I'm of course interested in cases where io_uring isn't performing to its full potential, it most certainly should.

Woud you look at my new benchmark? I have eliminated everything but epoll and io_uring and on both my machines epoll wins despite io_uring being SQ-polled with 0 syscalls and using pre-registered file descriptors and buffers. I'm not involving any networking at all.

strace shows the epoll case make millions of syscalls while the io_uring is entirely silent in the syscalling department.

What am I doing wrong / why is io_uring not performing?

ghost avatar Dec 07 '20 11:12 ghost

@alexhultman Check netty/netty#10622 - I won't have any bandwidth for the next day or two, but maybe that will pique your interest.

The fact that a virtualized garbage collected JIT-stuttery Java project can see performance improvements when swapping from epoll to io_uring is not a viable proof of io_uring itself (as a kernel feature) being more performant than epoll itself (as a kernel feature). It really just proves that writing systems in non-systems programming languages are going to get you poor results.

io_uring does more things in kernel, meaning a swap from epoll to io_uring leads to less things happening in Java. As a general rule of tumb; the less you do in high level garbage collected virtualized code, the better.

ghost avatar Dec 07 '20 11:12 ghost

@alexhultman which kernel version did you use?

romange avatar Dec 07 '20 12:12 romange

It is clearly stated in the posted text. 5.9.9

ghost avatar Dec 07 '20 12:12 ghost

@alexhultman By running your tests locally with a 5.10-rc5 version it seems I'm seeing io_uring behave better, or am I reading it wrong?:

$  ./epoll 1000
Pipes: 1000
Time: 16.059609

$ sudo ./io_uring 1000
Pipes: 1000
Time: 10.984051

$ ./epoll 1500
Pipes: 1500
Time: 24.501726

$ sudo ./io_uring 1500
Pipes: 1500
Time: 18.112729

$ ./epoll 2000
Pipes: 2000
Time: 37.705230

$ sudo ./io_uring 2000
Pipes: 2000
Time: 26.174995

santigimeno avatar Dec 07 '20 12:12 santigimeno

First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11.

Secondly, I've looked at your benchmark code and you test something that is not necessarily relevant nor optimal for networking use-case.

  1. You test IORING_SETUP_SQPOLL mode - I did not succeed to get any performance gain there with sockets. In fact it was consistently worse than using non-polling mode.

  2. You test a single epoll/io_uring loop which does not trigger contention edge-cases inside kernel. When you have N cores running N epoll loops doing read/writes via socket you put 100% load on your machine, you will see how io_uring performs better.

Finally, I've never tried using pipes in my tests. io_uring essentially delegates requests to the their corresponding APIs. So if, for example, the pipes kernel code takes most CPU you may not see much difference between io_uring or epoll.

romange avatar Dec 07 '20 12:12 romange

Also, just to add (note that I was a skeptic above as well): Hypervisors tend to level out epoll vs io_uring performance benchmarks in my findings - if you're within a VM, expect io_uring to have less than ideal performance increases against epoll. Testing on bare metal seems to make quite a bit of difference depending on how you're using io_uring.

First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11.

@romange Is that already on the io_uring branch of mainline? Just curious.

Qix- avatar Dec 07 '20 13:12 Qix-

Here are the results for @alexhultman 's benchmark on my machine (MyTuxedo laptop, Intel i7-7700HQ (8) @ 3.800GHz, 32GB RAM, Ubuntu 20.10 x86_64, Kernel: 5.10.0-051000rc6-generic)

 make runs
rm -f epoll_runs
rm -f io_uring_runs
for i in `seq 100 100 1000`; do ./io_uring $i; done
Pipes: 100
Time: 0.908056
Pipes: 200
Time: 2.063551
Pipes: 300
Time: 3.183146
Pipes: 400
Time: 4.810344
Pipes: 500
Time: 5.609743
Pipes: 600
Time: 8.197645
Pipes: 700
Time: 10.275732
Pipes: 800
Time: 11.889881
Pipes: 900
Time: 15.030963
Pipes: 1000
Time: 15.421023

for i in `seq 100 100 1000`; do ./epoll $i; done
Pipes: 100
Time: 1.575792
Pipes: 200
Time: 3.173769
Pipes: 300
Time: 5.173567
Pipes: 400
Time: 7.255583
Pipes: 500
Time: 10.283918
Pipes: 600
Time: 12.986523
Pipes: 700
Time: 14.560208
Pipes: 800
Time: 17.426127
Pipes: 900
Time: 19.796715
Pipes: 1000
Time: 23.262279

io_uring gives better results!

martin-g avatar Dec 07 '20 13:12 martin-g

Also, just to add (note that I was a skeptic above as well): Hypervisors tend to level out epoll vs io_uring performance benchmarks in my findings - if you're within a VM, expect io_uring to have less than ideal performance increases against epoll. Testing on bare metal seems to make quite a bit of difference depending on how you're using io_uring.

First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11.

@romange Is that already on the io_uring branch of mainline? Just curious.

I do not think it's on io_uring branch because the fix does not reside in io_ring code. Here is the relevant article: https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.11-Task-Work-Opt

romange avatar Dec 07 '20 13:12 romange

$ make
gcc -O3 epoll.c -o epoll
gcc -O3 io_uring.c /usr/lib/liburing.a -o io_uring
gcc: error: /usr/lib/liburing.a: No such file or directory
make: *** [Makefile:3: default] Error 1

YoSTEALTH avatar Dec 07 '20 13:12 YoSTEALTH

$ make
gcc -O3 epoll.c -o epoll
gcc -O3 io_uring.c /usr/lib/liburing.a -o io_uring
gcc: error: /usr/lib/liburing.a: No such file or directory
make: *** [Makefile:3: default] Error 1
git submodule update --init --recursive 
cd liburing/
./configure && make
sudo make install

romange avatar Dec 07 '20 13:12 romange

@martin-g @santigimeno That's very interesting - thanks for reporting! I will do some more testing on newer kernels and see if I can finally get to see this supposed io_uring wonder myself.

ghost avatar Dec 07 '20 15:12 ghost

@romange You do a lot of confident talking but you still refuse to follow up with any actual testing of your claims.

  1. I have tested all modes, the mode without SQ polling had insignificant differences in performance and because it caused syscalls to appear, I wanted to use SQ because that is where all the fuzz is about regarding io_uring.

  2. You haven't tested this, but you just assume. I did an actual test of this and I got results significantly opposing your so confident assumption.

Please do testing before you make up assumptions about everything. This entire thread is about this exact behavior - show with actual numbers like @santigimeno and @martin-g did.

ghost avatar Dec 07 '20 15:12 ghost

@alexhultman Please tone down the snark. FWIW, @romange has done plenty of testing in the past, and was instrumental in finding the task work related signal slowdown for threads. Questioning the validity of a test is perfectly valid, in fact it's the very first thing that should be done before potentially wasting any time on examining results from said test. Nobody is in a position to be making any demands in here, in my experience you get a lot further by being welcoming and courteous.

axboe avatar Dec 07 '20 16:12 axboe

When you have N cores running N epoll loops doing read/writes via socket you put 100% load on your machine, you will see how io_uring performs better.

The above is not a question, it is a confident statement without any backing proof other than "it will be the case". This is what is the issue here - blindly making claims without any backing proof other than "listen to my assumption".

ghost avatar Dec 07 '20 16:12 ghost

Nobody is in a position to be making any demands in here

Nobody is demanding anything here.

ghost avatar Dec 07 '20 16:12 ghost