liburing icon indicating copy to clipboard operation
liburing copied to clipboard

Buffer size and `io_uring` ring queue depth performance boost

Open rvineet02 opened this issue 2 years ago • 11 comments
trafficstars

Hi, I'm hoping to take advantage of io_uring to improve throughput for a networking i/o application. I am running the applications on Ubuntu, kernel version: 5.15.0-60-generic.

A gist with my client and server implementations can be found here.

When running with 128 threads, I am able to saturate the network, maximizing the throughput. However, I would like to maximize the throughput with fewer threads - enabling me to run the application on less beefy machines.

In order to achieve this, I reasoned that increasing the buffer size and/or the entries value in io_uring_queue_init should increase the number of bytes being sent across the network. But this is not the case. The throughput stays the same when varying either the buffer size or the queue depth in the ring (increasing the buffer size and queue depth by 2x until 64k).

I was wondering if there is some issue in client/server io_uring implementation.

Please let me know if you would like the profiling output from perf.

rvineet02 avatar Jul 17 '23 20:07 rvineet02

In order to achieve this, I reasoned that increasing the buffer size and/or the entries value in io_uring_queue_init should increase the number of bytes being sent across the network.

I think I spotted a mistake in the recv part:

	#define BUFFER_SIZE 16384

	// create a buffer
	char buffer[BUFFER_SIZE];
	struct iovec iov = {
			.iov_base = buffer,
			.iov_len = sizeof(buffer)};

	// prepare the readv operation
	io_uring_prep_recv(sqe, client_sock, &iov, 1, 0);

No matter how big your BUFFER_SIZE is, it will only read 1 byte from the socket. recv() is not readv(). The 4-th argument here is the number of bytes you're willing to read from the socket. Not the number of iovecs in the array.

Your comment indicates that you want to use readv(), but the code actually uses recv().

If you use io_uring_prep_recv(), then it should have been like this: (no struct iovec)

	#define BUFFER_SIZE 16384

	// create a buffer
	char buffer[BUFFER_SIZE];

	// prepare the recv operation
	io_uring_prep_recv(sqe, client_sock, buffer, BUFFER_SIZE, 0);

Also, note: You don't have to use readv() for reading from the socket, just use recv(). recv() performs better for socket operation in io_uring. It's specialized to handle sockets. The same goes for send() vs writev().

ammarfaizi2 avatar Jul 17 '23 22:07 ammarfaizi2

Thanks for the catch 👍

EDIT: would it make sense to do the same for the client as well? Something like this:

io_uring_prep_writev(sqe, sock, &iov, some_val, 0);

versus what I have currently:

io_uring_prep_writev(sqe, sock, &iov, 1, 0);

In effect, does increasing the number of vecs improve performance?

rvineet02 avatar Jul 18 '23 11:07 rvineet02

sorry for the close/re-open - clicked it on accident

rvineet02 avatar Jul 18 '23 11:07 rvineet02

Thanks for the catch +1

EDIT: would it make sense to do the same for the client as well? Something like this:

io_uring_prep_writev(sqe, sock, &iov, BUFFER_SIZE, 0);

versus what I have currently:

io_uring_prep_writev(sqe, sock, &iov, 1, 0);

What you have currently with writev() is ok. But I would suggest using send(), so it will be like this:

	io_uring_prep_send(sqe, sock, buffer, BUFFER_SIZE, 0);

You can remove your struct iovec that way.

ammarfaizi2 avatar Jul 18 '23 11:07 ammarfaizi2

When I modify the server with the BUFFER_SIZE parameter instead, I get a seg fault.

I updated the code in the gist. I am getting a seg-fault when attempting to wait a completion event.

Getting a null-ptr dereference exception at this line:

		ret = io_uring_wait_cqe(&ring, &cqe);

The same happens when I try to use peek as well.

rvineet02 avatar Jul 18 '23 12:07 rvineet02

You probably have a bad install with a mix of distro liburing packages and headers and ones you installed from source yourself. Clean that out and stick with one version where the headers and library match.

axboe avatar Jul 18 '23 12:07 axboe

From a quick glance of your gist, you're also still using and iov with recv. It takes the buffer and length, not an iovec. This is probably your crash as well, as you're going to be corrupting your stack when the receive happens.

axboe avatar Jul 18 '23 12:07 axboe

yup, that was the issue, my bad. But, even after making these changes on both the client and server still seeing roughly the same throughput increasing buffer size.

At this point, could it be the network that is the bottleneck?

rvineet02 avatar Jul 18 '23 12:07 rvineet02

At this point, could it be the network that is the bottleneck?

Your information severely lacks detail. It is evident that you failed to adhere to the given advice regarding the usage of send() and recv(). Surprisingly, you continue to utilize writev() despite the guidance provided. Additionally, your expectation remains unclear.

I insist that you promptly present us with concrete numerical data for the purpose of comparison. And also, provide a functional test code that actually works, once you have made the necessary fixes.

alviroiskandar avatar Jul 18 '23 23:07 alviroiskandar

Hi, sorry about the lack of details. I have updated the gist.

I am running all experiments with the following defaults: 1 thread, 2048 ring depth.

Running on Chameleon Cloud, I am able see in an increase in throughput when increasing the buffer size:

$ ./src/client2 -t 1 -q 2048 -b 1024
Total requests sent: 9435589
MB Sent: 9214
Throughput in MB/s: 614

$ ./src/client2 -t 1 -q 2048 -b 2048
Total requests sent: 6137781
MB Sent: 11986
Throughput in MB/s: 799

$ ./src/client2 -t 1 -q 2048 -b 4096
Total requests sent: 3456695
MB Sent: 13497
Throughput in MB/s: 899

$ ./src/client2 -t 1 -q 2048 -b 8192
Total requests sent: 1752411
MB Sent: 13678
Throughput in MB/s: 911

$ ./src/client2 -t 1 -q 2048 -b 16384
Total requests sent: 1008269
MB Sent: 15715
Throughput in MB/s: 1047

$ ./src/client2 -t 1 -q 2048 -b 32768
Total requests sent: 490733
MB Sent: 15256
Throughput in MB/s: 1017

$ ./src/client2 -t 1 -q 2048 -b 65536
Total requests sent: 244174
MB Sent: 15174
Throughput in MB/s: 1047

Running the same experiment on AWS, I get:


$ ./src/client2 -t 1 -q 2048 -b 1024
Total requests sent: 17545260
MB Sent: 17127
Throughput in MB/s: 570

$ ./src/client2 -t 1 -q 2048 -b 2048
Total requests sent: 8778865
MB Sent: 17128
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 4096
Total requests sent: 4389254
MB Sent: 17128
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 8192
Total requests sent: 2199115
MB Sent: 17128
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 16384
Total requests sent: 1102643
MB Sent: 17129
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 32768
Total requests sent: 551337
MB Sent: 17130
Throughput in MB/s: 571

$ ./src/client2 -t 1 -q 2048 -b 65536
Total requests sent: 275687
MB Sent: 17131
Throughput in MB/s: 571

What could be the reason that even baseline throughput is much lower on AWS, and then why does the buffer size not affect throughput in this case.

Using iperf3, here are the bitrates for the network on AWS: ~4.69 Gbps and on Chameleon cloud: ~1.81 Gbps.

rvineet02 avatar Jul 28 '23 16:07 rvineet02

Given the iperf numbers, it appears you've roughly saturated the network in the AWS case, using just one thread instead of 128. 571 MB/s = 4.56 Gbps.

ryanseipp avatar Aug 06 '23 17:08 ryanseipp