liburing `io_uring` bandwidth with cached file

`io_uring` bandwidth with cached file

Open Tindarid opened this issue 2 years ago • 14 comments

I have this fio jobfile (with a file (./data1/file8) created before by fio).

[global]
filename=./data1/file8 ; 8G file
rw=read
invalidate=0
thread
offset=0
size=100%

[init_cache]
ioengine=sync

[psync]
wait_for_previous
group_reporting
ioengine=psync
numjobs=8
offset_increment=1g
io_size=1g

[uring]
wait_for_previous
group_reporting
ioengine=io_uring
numjobs=1
fixedbufs
iodepth=128

io_uring shows high latency (as expected). But the bandwidth is much less than bw of psync method (threadpool of workers doing reads). For example, for my machine (disk util = 0%):

Run status group 2 (all jobs): # psync
   READ: bw=11.7GiB/s (12.6GB/s), 11.7GiB/s-11.7GiB/s (12.6GB/s-12.6GB/s), io=8192MiB (8590MB), run=684-684msec

Run status group 3 (all jobs): # uring
   READ: bw=2904MiB/s (3045MB/s), 2904MiB/s-2904MiB/s (3045MB/s-3045MB/s), io=8192MiB (8590MB), run=2821-2821msec

Increasing number of threads in io_uring section helps to reach about 80% of psync performance.

What am I doing wrong?

Nov 05 '21 05:11 Tindarid

What kernel are you using?

Nov 05 '21 05:11 axboe

What kernel are you using?

5.13.0-20 (also tested on 5.11.*, ~same result)

Nov 05 '21 05:11 Tindarid

My guess here would be that your psync ends up parallellizing the mem copy of the fully cached file between 8 threads, which is going to be faster than using a single ring where you essentially end up doing the memory copy inline from submit. It boils down to a memory copy benchmark, and one setup has 8 threads and the other has 1... Hence I don't think you're doing anything wrong as such, the test just isn't very meaningful.

Nov 05 '21 05:11 axboe

parallelism

Run status group 1 (all jobs): # psync 1 thread
   READ: bw=3195MiB/s (3350MB/s), 3195MiB/s-3195MiB/s (3350MB/s-3350MB/s), io=8192MiB (8590MB), run=2564-2564msec

Run status group 2 (all jobs): # psync 2 threads
   READ: bw=6682MiB/s (7006MB/s), 6682MiB/s-6682MiB/s (7006MB/s-7006MB/s), io=8192MiB (8590MB), run=1226-1226msec

Run status group 3 (all jobs): # psync 4 threads
   READ: bw=11.5GiB/s (12.4GB/s), 11.5GiB/s-11.5GiB/s (12.4GB/s-12.4GB/s), io=8192MiB (8590MB), run=693-693msec

Run status group 4 (all jobs): # psync 8 threads
   READ: bw=12.0GiB/s (12.9GB/s), 12.0GiB/s-12.0GiB/s (12.9GB/s-12.9GB/s), io=8192MiB (8590MB), run=668-668msec

Run status group 5 (all jobs): # uring 1 thread
   READ: bw=3035MiB/s (3183MB/s), 3035MiB/s-3035MiB/s (3183MB/s-3183MB/s), io=8192MiB (8590MB), run=2699-2699msec

Run status group 6 (all jobs): # uring 2 thread
   READ: bw=5104MiB/s (5352MB/s), 5104MiB/s-5104MiB/s (5352MB/s-5352MB/s), io=8192MiB (8590MB), run=1605-1605msec

Run status group 7 (all jobs): # uring 4 thread
   READ: bw=7256MiB/s (7608MB/s), 7256MiB/s-7256MiB/s (7608MB/s-7608MB/s), io=8192MiB (8590MB), run=1129-1129msec

Run status group 8 (all jobs): # uring 8 thread
   READ: bw=6445MiB/s (6758MB/s), 6445MiB/s-6445MiB/s (6758MB/s-6758MB/s), io=8192MiB (8590MB), run=1271-1271msec

Clarifying the question: why psync in this case scales better than uring?

Nov 05 '21 05:11 Tindarid

Just ran a similar test here, just changing the io_uring case above to be 8 threads of 1G each like the psync case:

Run status group 0 (all jobs):
   READ: bw=29.3GiB/s (31.5GB/s), 29.3GiB/s-29.3GiB/s (31.5GB/s-31.5GB/s), io=8192MiB (8590MB), run=273-273msec

Run status group 1 (all jobs):
   READ: bw=29.7GiB/s (31.9GB/s), 29.7GiB/s-29.7GiB/s (31.9GB/s-31.9GB/s), io=8192MiB (8590MB), run=269-269msec

which shows about the same result, the runtime is short enough that there's a bit of variance between runs (+/- 1GB/sec either side). group 0 is psync here, group 1 is io_uring. For apples-to-apples, using iodepth=1 for the io_uring case as well. Does appear to be substantially slower to use higher queue depths for this. Didn't look into that yet, my guess would be that we're just spending extra time filling memory entries pointlessly for that.

Nov 05 '21 05:11 axboe

the test just isn't very meaningful

I am trying to replace threadpool of workers (they do just reads) with io_uring in database-application. Old solution doesn't use O_DIRECT and has double buffering. Benchmarks on real data shows that io_uring solution loses (and I am doing something wrong). So, my investigation end up with this test.

Another guess: does single core application need to have a thread pool of uring instances (to compete with old solution based on, for example, posix aio?

Nov 05 '21 06:11 Tindarid

I'll check in the morning, it's late here. Fio doesn't do proper batching either, might be a concern. In general, you should not need a thread pool, you can mark requests as going async with IOSQE_ASYNC and there's also logic to cap the max pending async thread count.

Nov 05 '21 06:11 axboe

One thing that is interesting here is that if I run with iodepth=1, then I get about ~7GB/sec of bandwidth from one thread, but when I run with iodepth=128, then I get only 3GB/sec of bandwidth. Looking at profiles, the fast case spends ~13% of the time doing memory copies, and the slow case uses ~55%. That doesn't make a lot of sense! The higher queue depth case should spend the same time doing copies, just reaping the benefits of the batched submits.

The theory here is that the total memory range used is one page for the qd=1 case, and it's 128 pages for the qd=128 case. That just falls out of cache. That's simply an artifact of the CPU, it's not really an io_uring thing. If I hacked fio to use the same buffer for all requests, I bet the 128 case would be faster than the qd=1 case.

Anyway, that's the theory. I'll dig into this and see what I can find.

Nov 05 '21 16:11 axboe

Thank you.

I tried nowait, force_async and played with iodepth, but bw only degrades (in this configuration). Maybe, really processor cache issue (but I haven't managed to find best parameters for it: with iodepth = 1 I have < 1GB/s, with iodepth = 128 I have 3 GB/s)

Nov 05 '21 17:11 Tindarid

shmhuge really helps alleviate pressure, but I think what we really need here is the ring sqe/cqe maps being in a huge page... That'll likely be a nice win overall too. Looking into it.

Nov 05 '21 17:11 axboe

Ran the "always copy to the same page" case for QD=128, and it didn't change anything. Puzzled, maybe this is tlb pressure? So I added iomem=shmhuge to use a huge page as backing for the job, and now the QD=128 job runs in ~10GB/sec and the QD=1 runs in ~7.5GB/sec. That's a lot more inline with what I'd expect. We're saving some time on being able to do a bunch of ios in the same syscall, and the that just yields more time to run the copy and hence a higher performance.

Nov 05 '21 17:11 axboe

I've added kernel support for using a single huge page for the rings, that should cut down on TLB pressure which I think is what is killing us in this test. I'll re-run tests on Monday with that. liburing support also exists in the 'huge' branch. Note that both of these are pretty experimental, I literally just started on the kernel side yesterday late afternoon and did the liburing changes this morning.

Nov 06 '21 19:11 axboe

Can you try with iomem=shmhuge added to your fio job file? Curious what kind of difference you'd see with it.

Nov 08 '21 22:11 axboe

Can you try with iomem=shmhuge added to your fio job file? Curious what kind of difference you'd see with it.

No changes at all. With threaded/no-thread io_uring and psync.

Nov 09 '21 13:11 Tindarid

liburing liburing copied to clipboard

`io_uring` bandwidth with cached file

liburing
liburing copied to clipboard