liburing
liburing copied to clipboard
`io_uring` Benchmark
Could a proper(file, socket, features) benchmark be written so others can compare io_uring
speed with their software and so on... Maybe this can double as test to see if new/comparing feature yields benefit or not.
Everyone writes their own and I feel like they always miss something or do it wrong!
Thanks
Good idea. One standard bench is fio/t/io_uring, but there are other interesting cases like sockets. The only thing left is to find a volunteer :)
Agree it's a good idea, the problem is usually finding either the time or someone to do it... Disk io is simpler and already has some options, but networking is a pretty broad topic.
I have a repo which compares networking performance between io_uring and epoll with various operations
https://github.com/CarterLi/io_uring-echo-server
I'd like to contribute if someone is interested in it.
@CarterLi, would be awesome! Does it require a 3rd-party client?
It does require a 3rd party client. Would be nice if we had a client as well for this kind of purpose.
It should be Ok, if that's a well known protocol / tools, e.g. netperf. So curious if that's the case or it's a yet another custom thing. From a quick look it appears to be the last.
It's a well known client, most people end up using it.
It does require rust, though...
It's a well known client, most people end up using it.
I can't find an AUR package, though. Not as popular as I'd wish. Or do I miss it?
I was planning to implement client with io_uring & epoll too, but I found that async programming is rather complex.
I'd like to use some coroutine library like what I did in the example, but not ucontext because it's slow and not available on some libc implementations.
Is it ok to use some 3rd party library such as https://github.com/hnes/libaco?
It's a well known client, most people end up using it.
I can't find an AUR package, though. Not as popular as I'd wish. Or do I miss it?
I don't think It's a popular tool.
My io_uring-echo-server
is just an echo server, the easy way to test it is to use telnet
. But for benchmarking, you need some special tools.
Thanks everyone and @CarterLi for taking on this task. To give some more context
- This benchmark should be simple anyone can use/run using their favorite tools (could recommend what to use)
- Code must be reproducible, so others can write something equal in their language/software to compare. Like if you write something only you know what's going on, its too complex.
- Benchmark should only use functions implemented in liburing (if available), for example if liburing has
io_uring_prep_openat
should use it and notopen()
Lets starts with few very basic file/socket benchmarks then go from there?
The goal is to provide standardize benchmark! Jens posts IO benchmark and there is no way for anyone else to test it, run it on a different system to see if it produces similar results. By comparing a standard benchmark, on different system/setup you can improve your code, find bugs, ...
Jens posts IO benchmark and there is no way for anyone else to test it, run it on a different system to see if it produces similar results
That's not true, what I run is publicly available, and I also tell people exactly what I run and on exactly what hardware. It's the nature of benchmarks that they will produce different results depending on your setup, I don't think that is that important. What is important is showing people what they can expect from their system, and best practices in getting the best performance out of it.
Here's the one I use as a basic echo server. It's based on the one that freevib posted a while back. Not claiming it's the best thing out there, but it's a starting point. Then I use the same client that was posted above, and run it ala:
cargo run --release -- --address "127.0.0.1:1234" --number 20 --duration 10 --length 64
which yields ~430K/sec on my test box.
#include <liburing.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <errno.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <sys/poll.h>
#define MAX_CONNECTIONS 1024
#define BACKLOG 128
#define MAX_MESSAGE_LEN 2048
enum {
ACCEPT,
POLL_LISTEN,
POLL_NEW_CONNECTION,
READ,
WRITE,
};
struct conn_info {
unsigned fd;
unsigned type;
};
static struct conn_info conns[MAX_CONNECTIONS];
static char bufs[MAX_CONNECTIONS][MAX_MESSAGE_LEN];
static struct sockaddr_in client_addr;
static socklen_t client_len = sizeof(client_addr);
static int sock_listen_fd;
static void add_accept(struct io_uring *ring, int fd, unsigned flags)
{
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
struct conn_info *conn_i = &conns[fd];
io_uring_prep_accept(sqe, fd, (struct sockaddr *) &client_addr,
&client_len, 0);
io_uring_sqe_set_flags(sqe, flags);
conn_i->fd = fd;
conn_i->type = ACCEPT;
io_uring_sqe_set_data(sqe, conn_i);
}
static void add_socket_read(struct io_uring *ring, int fd, size_t size,
unsigned flags)
{
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
struct conn_info *conn_i = &conns[fd];
io_uring_prep_recv(sqe, fd, &bufs[fd], size, 0);
io_uring_sqe_set_flags(sqe, flags);
conn_i->fd = fd;
conn_i->type = READ;
io_uring_sqe_set_data(sqe, conn_i);
}
static void add_socket_write(struct io_uring *ring, int fd, size_t size,
unsigned flags)
{
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
struct conn_info *conn_i = &conns[fd];
io_uring_prep_send(sqe, fd, &bufs[fd], size, 0);
io_uring_sqe_set_flags(sqe, flags);
conn_i->fd = fd;
conn_i->type = WRITE;
io_uring_sqe_set_data(sqe, conn_i);
}
static void handle_cqes(struct io_uring *ring, struct io_uring_cqe **cqes,
int cqe_count)
{
int i;
for (i = 0; i < cqe_count; ++i) {
struct io_uring_cqe *cqe = cqes[i];
struct conn_info *user_data;
user_data = (struct conn_info *) io_uring_cqe_get_data(cqe);
switch (user_data->type) {
case ACCEPT: {
int sock_conn_fd = cqe->res;
io_uring_cqe_seen(ring, cqe);
add_socket_read(ring, sock_conn_fd, MAX_MESSAGE_LEN, 0);
add_accept(ring, sock_listen_fd, 0);
break;
}
case READ: {
int bytes_read = cqe->res;
if (bytes_read <= 0) {
/*
* no bytes available on socket, client must be
* disconnected
*/
io_uring_cqe_seen(ring, cqe);
shutdown(user_data->fd, SHUT_RDWR);
} else {
/*
* bytes have been read into bufs, now add
* write to socket sqe
*/
io_uring_cqe_seen(ring, cqe);
add_socket_write(ring, user_data->fd,
bytes_read, 0);
}
break;
}
case WRITE:
// write to socket completed, re-add socket read
io_uring_cqe_seen(ring, cqe);
add_socket_read(ring, user_data->fd,
MAX_MESSAGE_LEN, 0);
break;
}
}
}
int main(int argc, char *argv[])
{
struct sockaddr_in serv_addr;
struct io_uring_params params = { };
struct io_uring ring;
const int val = 1;
int portno;
if (argc < 2) {
fprintf(stderr, "Please give a port number: %s [port]\n", argv[0]);
exit(0);
}
portno = strtol(argv[1], NULL, 10);
sock_listen_fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
if (sock_listen_fd < 0) {
perror("socket");
return 1;
}
setsockopt(sock_listen_fd, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val));
memset(&serv_addr, 0, sizeof(serv_addr));
serv_addr.sin_family = AF_INET;
serv_addr.sin_port = htons(portno);
serv_addr.sin_addr.s_addr = INADDR_ANY;
if (bind(sock_listen_fd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0) {
perror("Error binding socket..\n");
exit(1);
}
if (listen(sock_listen_fd, BACKLOG) < 0) {
perror("Error listening..\n");
exit(1);
}
printf("%d: Listening for connections on port: %d\n", getpid(), portno);
if (io_uring_queue_init_params(MAX_CONNECTIONS, &ring, ¶ms) < 0) {
perror("io_uring_init_failed...\n");
exit(1);
}
if (!(params.features & IORING_FEAT_FAST_POLL)) {
fprintf(stderr, "IORING_FEAT_FAST_POLL not available\n");
exit(0);
}
/* add first accept sqe, to monitor for new incoming connections */
add_accept(&ring, sock_listen_fd, 0);
while (1) {
struct io_uring_cqe *cqes[BACKLOG];
struct io_uring_cqe *cqe;
int ret, cqe_count;
io_uring_submit(&ring);
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret != 0) {
perror("Error io_uring_wait_cqe\n");
exit(1);
}
/*
* check how many cqe's are on the cqe ring, and put these
* cqe's in an array
*/
cqe_count = io_uring_peek_batch_cqe(&ring, cqes,
sizeof(cqes) / sizeof(cqes[0]));
handle_cqes(&ring, cqes, cqe_count);
}
return 0;
}
Here's an example using 500 clients, I'm running this:
cargo run --release -- --address "127.0.0.1:1234" --number 500 --duration 10 --length 64
and using the above included echo server:
Benchmarking: 127.0.0.1:1234
500 clients, running 64 bytes, 10 sec.
Speed: 199191 request/sec, 199190 response/sec
Requests: 1991911
Responses: 1991909
and for an equivalent epoll based server:
Benchmarking: 127.0.0.1:1234
500 clients, running 64 bytes, 10 sec.
Speed: 184895 request/sec, 184895 response/sec
Requests: 1848956
Responses: 1848953
This is a kernel with none of the mitigations enabled.
That's not true, what I run is publicly available,
O, ok, sorry Jens, I wasn't trying to blame you, I didn't know they were available or where they were. I meant it as if you had such tools in liburing/benchmark people could easily access and use it.
Here's the one I use as a basic echo server. It's based on the one that freevib posted a while back. Not claiming it's the best thing out there, but it's a starting point. Then I use the same client that was posted above, and run it ala:
Thanks Jens, This is what I meant by everyone has their own benchmark! If something like this could be added in liburing/benchmark now everyone can use it vs creating their own and improve your code as well. Its a win win thing.
O, ok, sorry Jens, I wasn't trying to blame you, I didn't know they were available or where they were. I meant it as if you had such tools in liburing/benchmark people could easily access and use it.
It's just important to note that it's not the case that I just post random things and nobody knows where it's from. For the more "official" posts, I reference it directly. And for a lot of others, I always mention that it's the usual test. More times than not people will then ask, and I reply. The benchmark may not be included with liburing, but it is included with fio. It doesn't use liburing as it so happens, I wrote that tool before liburing was really a thing. Hence it uses the kernel interface directly, and you can find it here:
https://git.kernel.dk/cgit/fio/tree/t/io_uring.c
Might be interesting to turn it into a liburing API instead. Would potentially also help with making sure that liburing is as efficient as it can be. Only reason I haven't done that is that I don't want to maintain two different copies of it. If it was converted and we're happy with it, then the raw API one could be dropped.
The code used by Axboe can be optimised more:
- io_uring_submit & io_uring_wait_cqe can be combined into io_uring_submit_and_wait to avoid an extra syscall
- for_each_cqe can be used instead of io_uring_peek_batch_cqe to avoid extra data copy
- multiple io_uring_cqe_seen can be combined into one io_uring_cqe_advance
I believe that the author has updated the code in the original repo. You may use it for better results.
I personally am interested in splice for zero-copy technology. But for my old quick test with epoll it performed even worse than recv/send. Lets see if splice with io_uring will be better.
EDIT: Man pages says that SPLICE_F_MOVE is not implemented and is simply ignored by kernel. Therefore splice does copy between file and pipe buffer.
It'd be great if someone would take on that project of providing a reference implementation, not saying what I posted is it, it's just what I've run in the past. I just don't have time to take on more on that front. And I do question the validity of a simple echo server, but at the same time it's a nice and simple demo. I'm going to leave that to the folks that would potentially do this work.
samba uses io_uring with splice and has posted some impressive numbers, so perhaps that'd be interesting.
Got some numbers: https://github.com/CarterLi/liburing4cpp/tree/master#benchmark
io_uring has overall better performance than epoll and linux aio thanks to syscall batching, except:
- io_uring's IORING_IO_POLL is slightly slower than linux aio's IO_CMD_POLL on high load ( 500 conns ).
- IORING_IO_SPLICE is very slow, much slower than splice(2)
There's no major performance benefits using SPLICE over RECV-SEND. I think it's because SPLICE_F_MOVE
is not implemented in kernel.
I will bench IORING_SETUP_SQPOLL, fixed file, fixed buffer later.
Good work, I see though later.
- poll, it has never been of much importance but looks there is enough of parties now interesting in making it performant, so hopefully we'll get it.
- splice is slow most probably because it's punted to io-wq, there are internal reasons for that. We can try to revisit if interesting.
- splice is slow most probably because it's punted to io-wq, there are internal reasons for that. We can try to revisit if interesting.
It seems that splice will be punted into io-wq even if it can be completed inline ( i.e. prefixing splice with poll wont help ). It definitely needs some revisits.
i wrote a performance testbed for QUIC servers several months ago, to compare the performance of all the open source QUIC implementations but also using io_uring vs syscalls. i haven't touched it in a few months. at the moment the only test it implements is a maximum throughput test, where recvmmsg
and sendmmsg
always outperform io_uring sending and recving single messages at a time (i forget the differential).
but this traffic pattern is totally alien to that of my application server, where it's... timeout fires and one PING is sent... or the server finishes processing a client message and sends back a single GSO packet. so i'm certain that on the send path io_uring will be significantly more efficient. the recv path i'm unsure of, and would have to test on recorded production traffic, because it depends on how many packets are waiting on socket to be recved in between application work being done. the entire server is based on C++20 coroutines, so we yield on every database or 3rd party api wait... so likely io_uring will still be the better choice.
if we had the ability to send and recv multiple messages at once with io_uring then it would surely always be faster than syscalls.
i share this anecdote to illuminate how often traffic and application patterns totally dictate best interface choices.
anyone here's the link: https://github.com/victorstewart/quicperf
if you try to build it and it breaks just let me know and i'll fix it.
I found that SETUP_SQPOLL has negative impact on performance. Does anyone have some benchmark results of SETUP_SQPOLL too?
I got similar results with/without FIXED_FILE, while the code used FIXED_FILE was much more complex than the code without FIXED_FILE. I think FIXED_FILE should only be used with SETUP_IOPOLL. I didn't bench SETUP_IOPOLL though.
- splice is slow most probably because it's punted to io-wq, there are internal reasons for that. We can try to revisit if interesting.
It seems that splice will be punted into io-wq even if it can be completed inline ( i.e. prefixing splice with poll wont help ).
Right
I found that SETUP_SQPOLL has negative impact on performance. Does anyone have some benchmark results of SETUP_SQPOLL too?
Easily can be, it not always guarantees a performance win
I got similar results with/without FIXED_FILE, while the code used FIXED_FILE was much more complex than the code without FIXED_FILE. I think FIXED_FILE should only be used with SETUP_IOPOLL.
It's definitely not IOPOLL-only, but indeed depends on your workload. If it spends lot of time anywhere but in io_uring, e.g. overhead of the network layer, the win from fixed files may just be negligible. But there are enough of non-IOPOLL cases where it does matter.
edited
i wrote a performance testbed for QUIC servers several months ago, to compare the performance of all the open source QUIC implementations but also using io_uring vs syscalls. i haven't touched it in a few months. at the moment the only test it implements is a maximum throughput test, where
recvmmsg
andsendmmsg
always outperform io_uring sending and recving single messages at a time (i forget the differential).
@victorstewart, I'm curious, is it "only 1 request in-flight at a time" or just submitting by 1?
If the former, it won't beat the sync version (unless there are some features that can't be used w/o io_uring). The latter is expensive as well.
@isilence submitting hundreds at once. i forget how many but i did rough experiments to isolate the optimal batch size for throughput.
@isilence submitting hundreds at once. i forget how many but i did rough experiments to isolate the optimal batch size for throughput.
Ok, we need to try it out then and see where all the overhead is