liburing
liburing copied to clipboard
My benchmark for multi-threaded writes
I've created a benchmark program to test my particular work flow: multi-threaded writes to generate a TIFF file. This is a cmake project that is easy to build and run.
The benchmark divides an image into M strips and uses a thread pool to do some work on each strip and then write the strip to libtiff. Each thread manages its own ring - the work queue is shared. For comparison, the benchmark also supports synchronous writes.
Here are some outputs from /usr/bin/time for various IO configurations for writing a 55000x88000 TIFF file on the following system :
Fedora 36 desktop : 5.17 kernel, xfs file system, 48 thread Threadripper with PCI4 NVME SSD, and latest liburing.
URING
uringwrite withdirectIO (2 threads)
Run with concurrency = 2, store to disk = 1, direct = 1, use uring = 1
: 714.831964 ms
Command being timed: "./iobench -c 2 -d -f /run/media/xfs/io_out.tif"
User time (seconds): 0.91
System time (seconds): 0.60
Percent of CPU this job got: 194%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.78
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 741164
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 184445
Voluntary context switches: 3282
Involuntary context switches: 7
Swaps: 0
File system inputs: 25
File system outputs: 5500888
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
uringwrite withdirectIO (48 threads)
Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 1
: 892.141297 ms
Command being timed: "./iobench -c 0 -d -f /run/media/xfs/out.tif"
User time (seconds): 1.92
System time (seconds): 6.98
Percent of CPU this job got: 915%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1367200
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 2
Minor (reclaiming a frame) page faults: 342316
Voluntary context switches: 184711
Involuntary context switches: 654
Swaps: 0
File system inputs: 176
File system outputs: 5500888
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
uringwrite withbufferedIO (48 threads)
Run with concurrency = 48, store to disk = 1, direct = 0, use uring = 1
: 1779.148543 ms
Command being timed: "./iobench -c 0 -f /run/media/xfs/out.tif"
User time (seconds): 2.05
System time (seconds): 31.08
Percent of CPU this job got: 1855%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.78
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2401952
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 601281
Voluntary context switches: 47107
Involuntary context switches: 2326
Swaps: 0
File system inputs: 0
File system outputs: 5500880
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
SYNCHRONOUS
- synchronous write with
directIO (2 threads)
Run with concurrency = 2, store to disk = 1, direct = 1, use uring = 0
: 996.828683 ms
Command being timed: "./iobench -c 2 -s -d -f /run/media/xfs/io_out.tif"
User time (seconds): 0.93
System time (seconds): 0.23
Percent of CPU this job got: 115%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.01
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 52896
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 12287
Voluntary context switches: 1617
Involuntary context switches: 162
Swaps: 0
File system inputs: 33
File system outputs: 5500888
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
synchronouswrite withdirectIO (48 threads)
Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 0
: 945.703310 ms
Command being timed: "./iobench -c 0 -d -s -f /run/media/xfs/out.tif"
User time (seconds): 1.92
System time (seconds): 1.34
Percent of CPU this job got: 294%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.11
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 181516
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 44475
Voluntary context switches: 7786
Involuntary context switches: 216
Swaps: 0
File system inputs: 8
File system outputs: 5500888
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
synchronouswrite withbufferedIO (48 threads)
Run with concurrency = 48, store to disk = 1, direct = 0, use uring = 0
: 1713.267466 ms
Command being timed: "./iobench -c 0 -s -f /run/media/xfs/out.tif"
User time (seconds): 1.61
System time (seconds): 30.54
Percent of CPU this job got: 1870%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.71
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 131544
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 33527
Voluntary context switches: 14663
Involuntary context switches: 75
Swaps: 0
File system inputs: 0
File system outputs: 5505520
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Conclusions
- For best performance, use direct IO.
- With direct IO using 2 threads,
uringwrite is ~20% faster than synchronous write while using about the same number of CPU cycles.
related issue: https://github.com/axboe/liburing/issues/511
May I get more details:
- Is it buffered/direct write
- Do you observe the cpu usage when it goes up from 4 therads to 32
- for the synchronous writes program, does the same problem exist?(the write itself may be the bottleneck)
May I get more details:
- Is it buffered/direct write
- Do you observe the cpu usage when it goes up from 4 therads to 32
- for the synchronous writes program, does the same problem exist?(the write itself may be the bottleneck)
Thanks. This is for buffered writes. I haven't measured CPU usage but I will do this. I've posted the benchmark output above - at highest thread concurrency uring is faster than synchronous, but it doesn't scale as threads increase, which is what I expected.
@HowHsu I have added CPU usage measurements above.
How many files are being written in parallel? There's work in progress for supporting more efficient buffered writes, but in terms of the current situation, io_uring will punt the write to io-wq and io-wq will serialize writes to a single file. That reduces concurrency to the file, but it's a necessary evil because otherwise you just end up with lots of io-wq threads that block on the file lock trying to write to it. This is most likely what you are observing as concurrency being lower than you expect.
Haven't looked at your results yet, but thanks for making a benchmark for this. Will definitely ease analyzing it and figuring out if we can improve things.
@axboe thanks, good to know. My use case is for writes to a single file. When you mention file lock, do you mean a lock that the kernel takes when writing to the file ? Or do you mean my user space lock ? Because I realized there is a way of removing my lock if I have multiple rings.
I haven't looked at your benchmark yet, was referring to the file system in the kernel locking the file for write.
But in general, NEVER share a ring if you can avoid it. Rings are meant to be cheap enough to use per-thread, and if you do that, then you don't need any synchronization for submitting/completing IO.
For this case, you'd want to share the io-wq backend between rings though, which can be done with IORING_SETUP_ATTACH_WQ when setting up subsequent rings. That will still serialize writes across the rings for the same file, if you skipped that flag and had multiple rings, then you'd get higher concurrency but also higher locking contention on the file. Something you can experiment with.
BTW, and this might be obvious, but opening the file(s) for write with O_DIRECT will eliminate the kernel locking on eg XFS at least and be a lot more efficient. io_uring won't serialize those writes either. The downside is of course that you get no caching, and that there are requirements in terms of IO buffer alignment and file offset alignment.
Thanks! I will look into multiple rings. And I do think O_DIRECT works for my use case - going to do some experimentation.
Stefan has a set of patches for buffered file writes, would be great if we could test those too. I can probably do that, since you posted the test tool. We just need the patches posted publicly for review...
@axboe fantastic! let me know if I can help in any way re testing
@axboe does O_DIRECT also have similar efficiency improvements on BTRFS as well as XFS ? as this is Fedora default.
I think btrfs should be fine. In later kernels it uses the iomap infrastructure for O_DIRECT like xfs does.
great, thanks!
How do you invoke your test program? Is it required to pass parameters?
How do you invoke your test program? Is it required to pass parameters?
if no parameters,then it will iterate through all concurrency levels from 2 to std::thread::hardware_concurrency(),
or you can pass in the concurrency you want.
I might miss something, but if I understand the benchmark correctly, then:
- the run procedure in uring_mt_test creates several tasks.
- Each task is defined with a lambda that calls the function tiffFormat.encodePixel().
- The bulk of the work is done in the function tiffFormat.encodePixels().
- In the function encodePixels() we first take a mutex, the mutex is only released at the end of the function.
- Basically each task is processed synchronously
- There might be some I/O parallelism for each task, but
- we have to wait for the slowest I/O to complete for each task.
Is this assessment correct?
@shrfb that's basically correct, except that most of the work is done in the lambda method before calling encodePixels. And, with uring, the requests go into the ring and can be executed asynchronously.
The mutex is there because libtiff is not thread safe, so only one thread can "write" to libtiff at a time. The writes to libtiff are just advancing the library offsets, but the actual write is done outside of libtiff, using uring.
I will be hopefully be removing this mutex and using multiple rings.
I found a bug in my benchmark code: I've updated the timings above, this time for a 6 core laptop. With the bug fixed, the uring timing is equal to or greater than the synchronous timing.
With the bug fixed, the uring timing is equal to or greater than the synchronous timing.
Can you clarify what you mean here? I can interpret this in two conflicting ways :-)
With the bug fixed, the uring timing is equal to or greater than the synchronous timing.
Can you clarify what you mean here? I can interpret this in two conflicting ways :-)
No problem : for the 6 core laptop with 5.11 and ext4, uring takes twice as long as synchronous at full concurrency. For the 24 core desktop with 5.16, btrfs and PCI4 nvme SSD, uring takes ~15% less time than synchronous at full concurrency.
That's a big gap on the laptop. Granted buffered writes aren't as efficient as they could be yet (see mention of that patchset from @shrfb, that will fix that gap, but twice as long seems... Wrong. I wonder if 5.16 will do better on the laptop, or if there's something else going on here.
@axboe woops, I got the specs reversed : laptop has 5.17 and btrfs. desktop has 5.11 and ext4. But, yeah, twice as long seems too big a gap - I will do some double checking on the code.
@axboe I've fixed some bugs in the benchmark, and updated the timings for desktop and laptop.
I also managed to remove the single locked ring - now there is one ring per thread.
Timings for both synch and asynch are now roughly the same.
Still working on an O_DIRECT benchmark. Is the following an accurate assessment of how to use O_DIRECT ?
Direct transfer between the user buffer and the disk can only happen
if all of the following conditions are true:
1. The number of bytes transferred is a multiple of 512 bytes.
1. The file offset is a multiple of 512 bytes.
1. The user memory buffer address is aligned on a 512-byte boundary.
I have enabled work queue sharing across threads. ~This errors out on 5.11, but running a 5.17 VM on my desktop 48 thread machine,~ I can now almost consistently get asynch write around 10-15% faster than synchronous write for max concurrency. This looks pretty good to me, and once I get O_DIRECT enabled, it should improve the situation.
I have adjusted the benchmark to be a truer apples-to-apples comparison between synchronous writes and uring writes.
We now have uring significantly faster than synchronous on 48 thread system. Most importantly, the time to schedule the writes with uring is a fraction of the time to write synchronously. Next step is to enable O_DIRECT.
This benchmark is basically complete. It now supports synchronous and asynchronous writes, using both buffered and direct IO. Testing on writing a 2.8 GB file show that
- synchronous and asynchronous takes around the same total time
- direct IO is slower than buffered
- using a shared work queue makes no difference to performance
conclusion 2 was surprising. I'm interested to hear of any other ways I can improve performance.
A few more interesting data points for uring writes (from fedora 36 / kernel 5.17 system) :
- direct IO uses ~20% more wall-clock time than buffered
- direct IO uses ~50% less CPU than buffered
This benchmark is basically complete. It now supports synchronous and asynchronous writes, using both buffered and direct IO. Testing on writing a 2.8 GB file show that
1. synchronous and asynchronous takes around the same total time 2. direct IO is **slower** than buffered
Do you flush it to the drive? If the file (or most of it) fits into RAM it'd compare writing to RAM vs actual disk I/O, which is more or less useless.
Also, let me ask what was the I/O size?
This benchmark is basically complete. It now supports synchronous and asynchronous writes, using both buffered and direct IO. Testing on writing a 2.8 GB file show that
1. synchronous and asynchronous takes around the same total time 2. direct IO is **slower** than bufferedDo you flush it to the drive? If the file (or most of it) fits into RAM it'd compare writing to RAM vs actual disk I/O, which is more or less useless.
Also, let me ask what was the I/O size?
Thanks @isilence . Since I close the file descriptor before measuring the timing, there is a flush to disk.
In terms of I/O size, the buffers are aligned on 512 byte boundaries and their size is 64*1024. The final 64K buffer is usually not fully written, so this would be a buffered write even with O_DIRECT, but the rest of the buffers will use direct.