liburing icon indicating copy to clipboard operation
liburing copied to clipboard

Steep decline in send performance over increasing sizes

Open uNetworkingAB opened this issue 2 years ago • 27 comments

This is just a behavior I have noticed and can't get past.

Below 4kb ping/pong between two TCP processes, I get significantly better perf with io_uring - from 30% to 8% where 30% is tiny sends (64 bytes) and 8% is the 4kb send. It gradually loses competitiveness over size.

At 8kb I'm equal or below epoll. I can't explain this, I have really packed send buffers on a huge page so everything is really packed in memory. Can't get it to perform any good for larger sends. the zc send makes the numbers half (bad).

At 16kb I get 43% better perf. with epoll, io_uring for me just completely goes off a cliff.

I have confirmed that in all cases, I do get the full 16kb chunk received in one go, on both ends. So both epoll and io_uring are just handing entire chunks back and forth in the same manner.

Are we expecting any optimizations here? It doesn't matter if I register the send buffers or not.

uNetworkingAB avatar May 14 '23 06:05 uNetworkingAB

Okay I kind of understand the problem now:

The benefit of send syscall is that you can have your message in some shared buffer and copy it from there into kernel space very efficiently.

But since prep_send is async, you can't really have the same shared buffer but need to hold many shared buffers at different addresses. Even if they hold the same data.

Simply by providing different addresses to prep_send vs. providing the same address, it goes from 212k messages per second down to 150k. This while epoll is at 197k.

So epoll is fast because you build your message in one spot, giving that one spot to the kernel via copies while prep_send needs one address per socket, which tanks performance in this case.

Knowing whether you are going to send the same content to all sockets is not really possible, so send syscall has a big benefit here unless there is some way to make multiple addresses faster. Registering the whole big sendbuffer (all of them) makes it maybe a sliiiiiiight faster but still significantly below send syscall providing the same address to copy from.

uNetworkingAB avatar May 14 '23 09:05 uNetworkingAB

Yep. Confirmed on a different computer with a whole different distro. Passing the same address to prep_send is way faster than passing 1 address per socket, it makes a big difference and I make sure to pre-page fault all involved buffers by filling them with some constant data.

Registering buffers and prep_send_zc_fixed and all those advanced features just makes it slower. The fastest by far is just providing the same very address to all prep_send. As is the same for send syscall. But you can't really do this with prep_send, as mentioned above. So this kind of limits performance for this case, 16kb sends.

uNetworkingAB avatar May 14 '23 10:05 uNetworkingAB

For 32kb there is almost a 70% performance win for epoll. So it isn't really viable for anything other than tiny sends. I cannot get it to perform for anything above 8 kb.

uNetworkingAB avatar May 14 '23 12:05 uNetworkingAB

Can this be because with send syscall, my source buffer most likely already is in CPU cache? And with io_uring it most definitely is not, since I have so many source buffers with io_uring vs. only 1 single 32kb one that is written to, copied from, written to, copied from, vs. having all these scattered buffers written to (which definitely will not fit in CPU cache) and then read from (which definitely will need fetch).

uNetworkingAB avatar May 14 '23 12:05 uNetworkingAB

You are most likely running into TLB pressure. Have you tried putting your IO buffers in a huge page?

axboe avatar May 15 '23 01:05 axboe

Yep that was the first thing I did. Everything is aligned and densely packed on a huge page

uNetworkingAB avatar May 15 '23 03:05 uNetworkingAB

I've looked at the zerocopy send benchmark in examples. It sends one single 2MB huge table source many times - I have the opposite problem: I have 100 connections that receive data and send data all the time, so I have constantly changing source addresses since I need 100 send buffers in order to submit to io_uring.

I can get a bit better performance by calling io_uring_submit_and_wait(ring, 0) every 16th send and then reusing those buffers (reducing the range of my source buffers), but it's still not near send syscall performance.

I tried mixing send syscall for sending with io_uring for receive but that was as far as I saw not as good as only epoll. But for this I disabled direct descriptors so maybe I need to go back to that and only use FD for send syscall.

uNetworkingAB avatar May 15 '23 10:05 uNetworkingAB

Can this be because with send syscall, my source buffer most likely already is in CPU cache? And with io_uring it most definitely is not, since I have so many source buffers with io_uring vs. only 1 single 32kb one that is written to, copied from, written to, copied from, vs. having all these scattered buffers written to (which definitely will not fit in CPU cache) and then read from (which definitely will need fetch).

And to be fair here, it's pretty common that you don't assemble it in place but you get already prepared data to send, though the opposite also have place, e.g. if you have in-userspace encryption or any extra copy. And, once you constructed a buffer but cannot send it (i.e. get -EAGAN) you're not going to throw it away, so epoll and send NOWAIT will still need some sort of caching. Another note is that you're going to have the same problem with sendmmsg.

That aside, there is nothing in the kernel optimising for a single buffer sends like yours, so sounds like the problem is in caches indeed. Can you pin them to separate cores and run perf stat -dddd <program> for one task for both io_uring and epoll versions? E.g.

./pingpong --server
perf stat -dddd ./pingpong --client

From the description it also sounds like your numbers are driven by latency but not throughput. What numbers do you get when you pin the processes 1) both to the same cores and 2) to their own core. And what is CPU utilisation of the cores you use for both cases?

upd: I missed that you have 100 connections, it's probably not latency then.

isilence avatar May 15 '23 12:05 isilence

As for zerocopy, it's probably because of extra wake ups to deliver zc notification CQEs, but slight difference in how it goes through the network stack might affect it as well. I'd need to get profiles for your case to know, but for the former it's getting better at scale, e.g. when you have many connections or streaming instead of ping pong, though the example is not working well with either case. How can we reproduce your results?

I played with zc streaming before but never sent it publicly. The nice thing about zc notifications is that you easily can count in userspace how much is still inflight, i.e. sum of buffer sizes for all expected but not yet delivered notifications, send when there is space in the queue. If you know queue sizes that replaces the readiness model and even finer. We're also looking into having a better control of what the userspace is waiting for.

isilence avatar May 15 '23 12:05 isilence

How can we reproduce your results?

I can ready some triggering examples that can be compiled both for epoll and io_uring.

it's pretty common that you don't assemble it in place but you get already prepared data to send, though the opposite also have place, e.g. if you have in-userspace encryption or any extra copy. And, once you constructed a buffer but cannot send it (i.e. get -EAGAN) you're not going to throw it away, so epoll and send NOWAIT will still need some sort of caching

The winning design (as far as I can figure out) with epoll is to have 1 shared buffer where you assemble responses and then just call send syscall on it, then move on to the next socket, assembling another response in the same buffer, then calling send syscall, etc, etc. This works in 90% of cases as the hot path. EAGAIN is not the hot path, it almost never happens and is a secondary slow path.

The thing with io_uring vs. epoll is that the above hot path does not work, as you must spread out your assembling buffers which makes you fall out of CPU-cache and drop in perf.

uNetworkingAB avatar May 15 '23 13:05 uNetworkingAB

Send syscall that can take direct descriptors would be killer, I think

uNetworkingAB avatar May 15 '23 13:05 uNetworkingAB

How can we reproduce your results?

I can ready some triggering examples that can be compiled both for epoll and io_uring.

Would be great, thanks

isilence avatar May 16 '23 15:05 isilence

Is there some variant of io_uring_prep_send that queues up a SYNC variant of send? Like io_uring_prep_sync_send where it functions like the send syscall in that, it either succeeds in copying everything to kernel space, or it fails with either fewer bytes than specified, or completely with EAGAIN.

If such an op-code would exist, we could use the same hot path without blowing the CPU-cache limit like so:

  1. Hold a limited set of "hot path buffers", let's say 16 buffers of size 32kb. Assuming they all fit in CPU-cache.
  2. Perform 16 io_uring_prep_sync_send, pointing to respective buffer
  3. Call io_uring_submit (_and_wait?).
  4. Directly reclaim all 16 buffers by walking over the 16 send completions (regardless if they failed or not)
  5. Goto 1, doing it again.

This would allow batching, yet limit it to some small number that fits in CPU-cache. IMO it makes no sense to use ASYNC send calls for small 16kb sends where you're almost 100% going to have space available in kernel socket buffer, and only in the very slow path would you care to do something else.

Async send is more suited for large sends IMO

uNetworkingAB avatar May 17 '23 22:05 uNetworkingAB

Ah, but in the case it immediately succeeds (which is the hot path), io_uring_prep_send would be the same as io_uring_prep_sync_send, I'm a moron.

uNetworkingAB avatar May 17 '23 22:05 uNetworkingAB

The difference between io_uring_prep_sync_send and io_uring_prep_send would be that with io_uring_prep_sync_send you could do io_uring_submit_and_wait(ring, 16) without ever getting stuck, as those io_uring_prep_sync_send would immediately have a completion.

uNetworkingAB avatar May 17 '23 22:05 uNetworkingAB

Imagine IOSQE_ASYNC but IOSQE_NON_BLOCKING_ONLY

uNetworkingAB avatar May 17 '23 22:05 uNetworkingAB

In other news, I tried the "for-next" branch and for short messages I now have 42% better perf with io_uring. Going from 6.1 to "for-next" was 420k -> 467k per second for me. I just need to figure out this sending part.

uNetworkingAB avatar May 20 '23 16:05 uNetworkingAB

Okay I have another, much easier observation:

  • For small messages io_uring is way faster but over 8kb, and esp. at 16kb, there is significant slowdown in io_uring over epoll, even for passing one single shared address to send.

To trigger, you can do this:

  1. Make sure you have huge pages in your pool
  2. git clone https://github.com/uNetworking/uSockets.git
  3. cd uSockets
  4. git checkout 16kb
  5. make examples

Now you have the epoll examples, run ./tcp_server in one terminal and ./tcp_load_test 100 localhost 3000 64 in another terminal. The 64 argument is bytes to send and can be increased up to 16kb but no more (lol, buffer overflow).

Note down the performance of increasing sizes, say 64 bytes, 512 bytes, 1024 bytes, 8kb, 16kb

Now hit WITH_IO_URING=1 make examples to build the same apps with io_uring

Do the same size varying test.

The idea is that io_uring will perform WAY better for small messages (less than 8kb for me) but will start to dramatically slowdown for 16kb and above (but again, you can't test over 16kb without modifying examples/tcp_server.c and examples/tcp_load_test.c to hold more than 16kb. You see that char req[16kb] thing.

For me, I get 192k msg/sec with epoll at 16kb but only 160k with io_uring, and this is not even passing different send buffers, but one single common one. If I pass one different send buffer for every socket it drops like a stone (but let's take that issue later on).

The fact io_uring is slower with 16kb with the same send buffer should not be an issue of CPU-cache as there are no send buffers involved. Everything is static.

uNetworkingAB avatar May 20 '23 16:05 uNetworkingAB

Managed to get this run on two hosts once, now it doesn't seem to want to run again for me. Regardless of whether or not I'm using epoll or io_uring. Just see a lot of:

Client connected
Client connected
[...]

on startup the server side, and the load side does nothing:

sendbufs: 0x7f1391ecb000
Running benchmark now...
Req/sec: 0.000000
Req/sec: 0.000000

when it did run, didn't see any drop offs between io_uring or epoll though for 4k and 16k? But if you can figure out why it's not really working reliably for me, would be happy to take a closer look and run the numbers.

I can run the localhost version too, but I generally prefer using two hosts for any network testing as it's more relevant. The two hosts are connected over a 10g link.

axboe avatar May 20 '23 20:05 axboe

Oh looks like I was indeed missing huge pages on one side... But now the sender segfaults using 16k:

axboe@r7525 ~/gi/uSockets (16kb)> ./tcp_load_test 100 intel 3000 16384         2.009s
sendbufs: 0x7fd1e97ff000
fish: Job 1, './tcp_load_test 100 intel 3000 …' terminated by signal SIGSEGV (Address boundary error)

axboe avatar May 20 '23 20:05 axboe

Yeah I also do Ethernet cable testing when I want better numbers, but so far in this development I've only done localhost. The thing is currently about as stable as a house of cards on a moving donkey. I have a 10 year old mid range Intel CPU with 6mb L3 cache.

uNetworkingAB avatar May 20 '23 21:05 uNetworkingAB

You can compile like so if you want:

WITH_ASAN=1 WITH_IO_URING=1 make examples

if you want some hint of what blows up

uNetworkingAB avatar May 20 '23 21:05 uNetworkingAB

axboe@r7525 ~/gi/uSockets (16kb)> ./tcp_load_test 100 intel 3000 16384         3.589s
sendbufs: 0x7f13fbdff000
Running benchmark now...
AddressSanitizer:DEADLYSIGNAL
=================================================================
==9868==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x5619490df29e bp 0x604000000b90 sp 0x7ffc2f2fea30 T0)
==9868==The signal is caused by a WRITE memory access.
==9868==Hint: address points to the zero page.
    #0 0x5619490df29e in io_uring_prep_rw /usr/include/liburing.h:382
    #1 0x5619490df29e in io_uring_prep_send /usr/include/liburing.h:751
    #2 0x5619490df29e in us_socket_write_ref_counted src/io_uring/io_socket.c:110
    #3 0x5619490df29e in on_http_socket_data examples/tcp_load_test.c:43
    #4 0x5619490e097e in us_loop_run src/io_uring/io_loop.c:215
    #5 0x5619490e097e in main examples/tcp_load_test.c:133
    #6 0x7f140ea46189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #7 0x7f140ea46244 in __libc_start_main_impl ../csu/libc-start.c:381
    #8 0x5619490e2220 in _start (/home/axboe/git/uSockets/tcp_load_test+0x5220)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /usr/include/liburing.h:382 in io_uring_prep_rw
==9868==ABORTING

axboe avatar May 21 '23 01:05 axboe

That has to be io_uring_get_sqe returning null. But how is that even possible if you have 100 connections. On localhost it sends the entire thing in one go. So you must be getting fractions and sending more and more in a chain reaction. The test doesn't actually wait for 16kb just some data. But that works fine on localhost.

I can make it more strict by only sending when it received the full message

uNetworkingAB avatar May 21 '23 07:05 uNetworkingAB

The difference between io_uring_prep_sync_send and io_uring_prep_send would be that with io_uring_prep_sync_send you could do io_uring_submit_and_wait(ring, 16) without ever getting stuck, as those io_uring_prep_sync_send would immediately have a completion.

IMHO, that's what non-wait should have been there for, either -EAGAIN or it completes in the syscall. Sadly, nowait has never been well defined for io_uring.

isilence avatar May 21 '23 23:05 isilence

That has to be io_uring_get_sqe returning null. But how is that even possible if you have 100 connections. On localhost it sends the entire thing in one go. So you must be getting fractions and sending more and more in a chain reaction. The test doesn't actually wait for 16kb just some data. But that works fine on localhost.

I can make it more strict by only sending when it received the full message

Not sure it helps, but if you set MSG_WAITALL io_uring will do recv/send retries for TCP in case of short IO. I think it's still possible to get short IO but that's unlikely

isilence avatar May 21 '23 23:05 isilence

My observation of slowdown is on localhost and on localhost the benchmark provided works stable. It works stable because on localhost send always succeeds since there is always room for 16kb data to be immediately written to kernel socket buffers, at only 100 sockets involved. There is only ever 16kb times 100 messages in transit at any given time, so the benchmark on localhost is more of a IPC kind of benchmark. So it still has merit in showing CPU overhead of epoll vs. io_uring and it still very clearly for me shows a huge win for io_uring at small sizes, but for me 16kb is way slower with io_uring and that is when both epoll and io_uring performs the same exact task: sending and receiving whole chunks of 16kb messages between two local processes. There is never any split up of messages on localhost for me, all messages are delivered in one whole chunk, and so the test is entirely 1 to 1, achieving the exact same thing on epoll and io_uring.

uNetworkingAB avatar May 21 '23 23:05 uNetworkingAB