liburing icon indicating copy to clipboard operation
liburing copied to clipboard

My benchmark for multi-threaded writes

Open boxerab opened this issue 3 years ago • 41 comments

I've created a benchmark program to test my particular work flow: multi-threaded writes to generate a TIFF file. This is a cmake project that is easy to build and run.

The benchmark divides an image into M strips and uses a thread pool to do some work on each strip and then write the strip to libtiff. Each thread manages its own ring - the work queue is shared. For comparison, the benchmark also supports synchronous writes.

Here are some outputs from /usr/bin/time for various IO configurations for writing a 55000x88000 TIFF file on the following system :

Fedora 36 desktop : 5.17 kernel, xfs file system, 48 thread Threadripper with PCI4 NVME SSD, and latest liburing.

URING

  1. uring write with direct IO (2 threads)
Run with concurrency = 2, store to disk = 1, direct = 1, use uring = 1
 : 714.831964 ms
	Command being timed: "./iobench -c 2 -d -f /run/media/xfs/io_out.tif"
	User time (seconds): 0.91
	System time (seconds): 0.60
	Percent of CPU this job got: 194%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.78
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 741164
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 184445
	Voluntary context switches: 3282
	Involuntary context switches: 7
	Swaps: 0
	File system inputs: 25
	File system outputs: 5500888
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
  1. uring write with direct IO (48 threads)
Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 1
 : 892.141297 ms
	Command being timed: "./iobench -c 0 -d -f /run/media/xfs/out.tif"
	User time (seconds): 1.92
	System time (seconds): 6.98
	Percent of CPU this job got: 915%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1367200
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 2
	Minor (reclaiming a frame) page faults: 342316
	Voluntary context switches: 184711
	Involuntary context switches: 654
	Swaps: 0
	File system inputs: 176
	File system outputs: 5500888
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
  1. uring write with buffered IO (48 threads)
Run with concurrency = 48, store to disk = 1, direct = 0, use uring = 1
 : 1779.148543 ms
	Command being timed: "./iobench -c 0 -f /run/media/xfs/out.tif"
	User time (seconds): 2.05
	System time (seconds): 31.08
	Percent of CPU this job got: 1855%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.78
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2401952
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 601281
	Voluntary context switches: 47107
	Involuntary context switches: 2326
	Swaps: 0
	File system inputs: 0
	File system outputs: 5500880
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

SYNCHRONOUS

  1. synchronous write with direct IO (2 threads)
Run with concurrency = 2, store to disk = 1, direct = 1, use uring = 0
 : 996.828683 ms
	Command being timed: "./iobench -c 2 -s -d -f /run/media/xfs/io_out.tif"
	User time (seconds): 0.93
	System time (seconds): 0.23
	Percent of CPU this job got: 115%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.01
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 52896
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 12287
	Voluntary context switches: 1617
	Involuntary context switches: 162
	Swaps: 0
	File system inputs: 33
	File system outputs: 5500888
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
  1. synchronous write with direct IO (48 threads)
Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 0
 : 945.703310 ms
	Command being timed: "./iobench -c 0 -d -s -f /run/media/xfs/out.tif"
	User time (seconds): 1.92
	System time (seconds): 1.34
	Percent of CPU this job got: 294%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.11
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 181516
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 44475
	Voluntary context switches: 7786
	Involuntary context switches: 216
	Swaps: 0
	File system inputs: 8
	File system outputs: 5500888
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
  1. synchronous write with buffered IO (48 threads)
Run with concurrency = 48, store to disk = 1, direct = 0, use uring = 0
 : 1713.267466 ms
	Command being timed: "./iobench -c 0 -s -f /run/media/xfs/out.tif"
	User time (seconds): 1.61
	System time (seconds): 30.54
	Percent of CPU this job got: 1870%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.71
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 131544
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 33527
	Voluntary context switches: 14663
	Involuntary context switches: 75
	Swaps: 0
	File system inputs: 0
	File system outputs: 5505520
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Conclusions

  1. For best performance, use direct IO.
  2. With direct IO using 2 threads, uring write is ~20% faster than synchronous write while using about the same number of CPU cycles.

boxerab avatar Apr 20 '22 22:04 boxerab

related issue: https://github.com/axboe/liburing/issues/511

boxerab avatar Apr 20 '22 22:04 boxerab

May I get more details:

  1. Is it buffered/direct write
  2. Do you observe the cpu usage when it goes up from 4 therads to 32
  3. for the synchronous writes program, does the same problem exist?(the write itself may be the bottleneck)

HowHsu avatar Apr 22 '22 09:04 HowHsu

May I get more details:

  1. Is it buffered/direct write
  2. Do you observe the cpu usage when it goes up from 4 therads to 32
  3. for the synchronous writes program, does the same problem exist?(the write itself may be the bottleneck)

Thanks. This is for buffered writes. I haven't measured CPU usage but I will do this. I've posted the benchmark output above - at highest thread concurrency uring is faster than synchronous, but it doesn't scale as threads increase, which is what I expected.

boxerab avatar Apr 22 '22 11:04 boxerab

@HowHsu I have added CPU usage measurements above.

boxerab avatar Apr 22 '22 12:04 boxerab

How many files are being written in parallel? There's work in progress for supporting more efficient buffered writes, but in terms of the current situation, io_uring will punt the write to io-wq and io-wq will serialize writes to a single file. That reduces concurrency to the file, but it's a necessary evil because otherwise you just end up with lots of io-wq threads that block on the file lock trying to write to it. This is most likely what you are observing as concurrency being lower than you expect.

Haven't looked at your results yet, but thanks for making a benchmark for this. Will definitely ease analyzing it and figuring out if we can improve things.

axboe avatar Apr 22 '22 12:04 axboe

@axboe thanks, good to know. My use case is for writes to a single file. When you mention file lock, do you mean a lock that the kernel takes when writing to the file ? Or do you mean my user space lock ? Because I realized there is a way of removing my lock if I have multiple rings.

boxerab avatar Apr 22 '22 12:04 boxerab

I haven't looked at your benchmark yet, was referring to the file system in the kernel locking the file for write.

But in general, NEVER share a ring if you can avoid it. Rings are meant to be cheap enough to use per-thread, and if you do that, then you don't need any synchronization for submitting/completing IO.

For this case, you'd want to share the io-wq backend between rings though, which can be done with IORING_SETUP_ATTACH_WQ when setting up subsequent rings. That will still serialize writes across the rings for the same file, if you skipped that flag and had multiple rings, then you'd get higher concurrency but also higher locking contention on the file. Something you can experiment with.

axboe avatar Apr 22 '22 12:04 axboe

BTW, and this might be obvious, but opening the file(s) for write with O_DIRECT will eliminate the kernel locking on eg XFS at least and be a lot more efficient. io_uring won't serialize those writes either. The downside is of course that you get no caching, and that there are requirements in terms of IO buffer alignment and file offset alignment.

axboe avatar Apr 22 '22 12:04 axboe

Thanks! I will look into multiple rings. And I do think O_DIRECT works for my use case - going to do some experimentation.

boxerab avatar Apr 22 '22 13:04 boxerab

Stefan has a set of patches for buffered file writes, would be great if we could test those too. I can probably do that, since you posted the test tool. We just need the patches posted publicly for review...

axboe avatar Apr 22 '22 13:04 axboe

@axboe fantastic! let me know if I can help in any way re testing

boxerab avatar Apr 22 '22 13:04 boxerab

@axboe does O_DIRECT also have similar efficiency improvements on BTRFS as well as XFS ? as this is Fedora default.

boxerab avatar Apr 22 '22 14:04 boxerab

I think btrfs should be fine. In later kernels it uses the iomap infrastructure for O_DIRECT like xfs does.

axboe avatar Apr 22 '22 14:04 axboe

great, thanks!

boxerab avatar Apr 22 '22 15:04 boxerab

How do you invoke your test program? Is it required to pass parameters?

shrfb avatar Apr 22 '22 17:04 shrfb

How do you invoke your test program? Is it required to pass parameters?

if no parameters,then it will iterate through all concurrency levels from 2 to std::thread::hardware_concurrency(), or you can pass in the concurrency you want.

boxerab avatar Apr 22 '22 18:04 boxerab

I might miss something, but if I understand the benchmark correctly, then:

  • the run procedure in uring_mt_test creates several tasks.
  • Each task is defined with a lambda that calls the function tiffFormat.encodePixel().
  • The bulk of the work is done in the function tiffFormat.encodePixels().
  • In the function encodePixels() we first take a mutex, the mutex is only released at the end of the function.
    • Basically each task is processed synchronously
    • There might be some I/O parallelism for each task, but
    • we have to wait for the slowest I/O to complete for each task.

Is this assessment correct?

shrfb avatar Apr 22 '22 20:04 shrfb

@shrfb that's basically correct, except that most of the work is done in the lambda method before calling encodePixels. And, with uring, the requests go into the ring and can be executed asynchronously.

The mutex is there because libtiff is not thread safe, so only one thread can "write" to libtiff at a time. The writes to libtiff are just advancing the library offsets, but the actual write is done outside of libtiff, using uring.

I will be hopefully be removing this mutex and using multiple rings.

boxerab avatar Apr 22 '22 21:04 boxerab

I found a bug in my benchmark code: I've updated the timings above, this time for a 6 core laptop. With the bug fixed, the uring timing is equal to or greater than the synchronous timing.

boxerab avatar Apr 25 '22 02:04 boxerab

With the bug fixed, the uring timing is equal to or greater than the synchronous timing.

Can you clarify what you mean here? I can interpret this in two conflicting ways :-)

axboe avatar Apr 25 '22 21:04 axboe

With the bug fixed, the uring timing is equal to or greater than the synchronous timing.

Can you clarify what you mean here? I can interpret this in two conflicting ways :-)

No problem : for the 6 core laptop with 5.11 and ext4, uring takes twice as long as synchronous at full concurrency. For the 24 core desktop with 5.16, btrfs and PCI4 nvme SSD, uring takes ~15% less time than synchronous at full concurrency.

boxerab avatar Apr 25 '22 21:04 boxerab

That's a big gap on the laptop. Granted buffered writes aren't as efficient as they could be yet (see mention of that patchset from @shrfb, that will fix that gap, but twice as long seems... Wrong. I wonder if 5.16 will do better on the laptop, or if there's something else going on here.

axboe avatar Apr 25 '22 21:04 axboe

@axboe woops, I got the specs reversed : laptop has 5.17 and btrfs. desktop has 5.11 and ext4. But, yeah, twice as long seems too big a gap - I will do some double checking on the code.

boxerab avatar Apr 25 '22 22:04 boxerab

@axboe I've fixed some bugs in the benchmark, and updated the timings for desktop and laptop. I also managed to remove the single locked ring - now there is one ring per thread. Timings for both synch and asynch are now roughly the same. Still working on an O_DIRECT benchmark. Is the following an accurate assessment of how to use O_DIRECT ?

Direct transfer between the user buffer and the disk can only happen 
if all of the following conditions are true:

1. The number of bytes transferred is a multiple of 512 bytes.
1. The file offset is a multiple of 512 bytes.
1. The user memory buffer address is aligned on a 512-byte boundary.

boxerab avatar May 01 '22 00:05 boxerab

I have enabled work queue sharing across threads. ~This errors out on 5.11, but running a 5.17 VM on my desktop 48 thread machine,~ I can now almost consistently get asynch write around 10-15% faster than synchronous write for max concurrency. This looks pretty good to me, and once I get O_DIRECT enabled, it should improve the situation.

boxerab avatar May 01 '22 14:05 boxerab

I have adjusted the benchmark to be a truer apples-to-apples comparison between synchronous writes and uring writes. We now have uring significantly faster than synchronous on 48 thread system. Most importantly, the time to schedule the writes with uring is a fraction of the time to write synchronously. Next step is to enable O_DIRECT.

boxerab avatar May 04 '22 12:05 boxerab

This benchmark is basically complete. It now supports synchronous and asynchronous writes, using both buffered and direct IO. Testing on writing a 2.8 GB file show that

  1. synchronous and asynchronous takes around the same total time
  2. direct IO is slower than buffered
  3. using a shared work queue makes no difference to performance

conclusion 2 was surprising. I'm interested to hear of any other ways I can improve performance.

boxerab avatar May 13 '22 03:05 boxerab

A few more interesting data points for uring writes (from fedora 36 / kernel 5.17 system) :

  1. direct IO uses ~20% more wall-clock time than buffered
  2. direct IO uses ~50% less CPU than buffered

boxerab avatar May 15 '22 16:05 boxerab

This benchmark is basically complete. It now supports synchronous and asynchronous writes, using both buffered and direct IO. Testing on writing a 2.8 GB file show that

1. synchronous and asynchronous takes around the same total time

2. direct IO is **slower** than buffered

Do you flush it to the drive? If the file (or most of it) fits into RAM it'd compare writing to RAM vs actual disk I/O, which is more or less useless.

Also, let me ask what was the I/O size?

isilence avatar May 15 '22 16:05 isilence

This benchmark is basically complete. It now supports synchronous and asynchronous writes, using both buffered and direct IO. Testing on writing a 2.8 GB file show that

1. synchronous and asynchronous takes around the same total time

2. direct IO is **slower** than buffered

Do you flush it to the drive? If the file (or most of it) fits into RAM it'd compare writing to RAM vs actual disk I/O, which is more or less useless.

Also, let me ask what was the I/O size?

Thanks @isilence . Since I close the file descriptor before measuring the timing, there is a flush to disk.

In terms of I/O size, the buffers are aligned on 512 byte boundaries and their size is 64*1024. The final 64K buffer is usually not fully written, so this would be a buffered write even with O_DIRECT, but the rest of the buffers will use direct.

boxerab avatar May 15 '22 16:05 boxerab