liburing icon indicating copy to clipboard operation
liburing copied to clipboard

EAGAINs impacting the performance of io_uring

Open vikramv3 opened this issue 1 year ago • 8 comments

When testing an io_uring implementation using a file read test that performs random reads of size 1 MB each (on a multi-node cluster), I'm running into high perf variability to due a large number of requests failing with EAGAIN. The implementation retries the requests that fail with EAGAIN. The nodes on which the number of EAGAINs are high tend to have a lower read throughput. Sometimes the number of EAGAINs on a node is very large, and the resulting throughput is lower than an implementation that uses just synchronous read system calls. All the nodes have the same configuration.

I have ensured that the number of in-flight requests submitted doesn't exceed the submission queue depth, and also ensured that the number of in-flight requests are lesser than the number of io_uring's async bounded workers. However, I'm still seeing a large percentage of requests (upto 50% on some nodes) failing with EAGAINs. Are there some other sources of EAGAINs that I haven’t considered? Is there a way through which I can identify the root cause of the EAGAINs?

I am using liburing 2.4 (iopolling enabled, SQ depth = 128, CQ depth = 256). Each node has 48 CPUs, Linux 5.10, ext4 file system and 4 NVMe SSDs with RAID. The file reads are evenly distributed across the 4 disks of each node.

@axboe, @isilence - I would really appreciate your guidance on this. Thanks a lot for your time.

vikramv3 avatar Jul 01 '24 08:07 vikramv3

Use a 6.x based kernel.

axboe avatar Jul 01 '24 12:07 axboe

Thanks @axboe! I'll test this out. Have some modifications gone into io_uring in 6.x with regard to EAGAINs?

vikramv3 avatar Jul 01 '24 16:07 vikramv3

Yes, otherwise I would not be suggesting that.

axboe avatar Jul 01 '24 16:07 axboe

Hi @axboe. Sorry about the delayed response.

I was not able to test it on 6.x as I’m unable to migrate to 6.x immediately. However, I hope to test this out soon. In the meantime, I tried the following which helped reduced the EAGAINs.

I brought the SQ depth down to 16, CQ depth down to 32, and set IOSQE_ASYNC for all submissions. With this, the EAGAINs reduced to ~0.1% of the total requests. The max number of bounded workers and the max number of inflight requests are set to the CQ depth.

However, with this approach I'm observing something strange with regard to the async worker threads. My workload issues 1MB reads. Some nodes on the cluster show very good read throughputs. Whereas, some other nodes have a much lower throughput (~50% lesser). All nodes have the same configurations. On digging deeper, I found that the number of iou-wq threads spawned on the good nodes is near 32. However, on the bad nodes it goes to 32 initially and then immediately drops to approximately 5 to 15 threads and stays near that for the remainder of the workload. I assumed that IOSQE_ASYNC always offloads the request to an async worker, so I’m not sure why the number of worker threads is not near 32 on all nodes. I’m also ensuring that the client itself is not the bottleneck and that the number of inflight requests is near 32 always. io_uring_sqe_set_flags(sqe, IOSQE_ASYNC) is invoked after the call to io_uring_prep_read. I'm using the default interrupt-driven mode of io_uring. The implementation uses 2 different threads to process the submissions and completions and the FDs are opened in direct mode.

Your inputs on what might be going wrong here would be very helpful. Thanks.

vikramv3 avatar Jul 18 '24 23:07 vikramv3

@isilence / @axboe I realized that my comment above is too long. Summarizing my query...

I see that the number of iou-wq workers spawned are much less than the number of inflight requests, even though all submissions have IOSQE_ASYNC set. I observe this behavior on only some nodes (not all) of my cluster. The max number of bounded workers has been set to the max number of inflight requests using io_uring_register_iowq_max_workers.

Is there any kernel parameter that might be influencing this? Are there any limitations with io_uring in 5.x with respect to these worker threads that might be causing this behavior? Please let me know. Thanks for your time.

vikramv3 avatar Jul 24 '24 22:07 vikramv3

@vikramv3 there was a bug in IOSQE_ASYNC not sure if its also the cause of your issue. https://github.com/axboe/liburing/issues/1181#issuecomment-2227461679

YoSTEALTH avatar Jul 26 '24 19:07 YoSTEALTH

@vikramv3 there was a bug in IOSQE_ASYNC not sure if its also the cause of your issue. #1181 (comment)

There was a problem with getsockopt, not IOSQE_ASYNC.

isilence avatar Jul 26 '24 20:07 isilence

I haven't looked into the question about the number of threads spawned, but in general IOSQE_ASYNC is a bad idea. Instead of doing asynchronous IO, or taking just 1 CPU / core for IOPOLL'ing, it'd be spawning a ton of io-wq threads, and each will be consuming CPU. Check global CPU consumption, likely it has spiked with IOSQE_ASYNC.

Also, downsizing SQ and CQ wouldn't affect performance in a good way (apart from extremes cases).

isilence avatar Jul 26 '24 20:07 isilence