seastar icon indicating copy to clipboard operation
seastar copied to clipboard

Horrible write IO performance on older kernels with the default of --poll-aio=1

Open StephanDollberg opened this issue 8 months ago • 37 comments

Hi team,

we are seeing abysmal write IO performance on older kernels. Specifically this at least affects versions up to and including 4.18 which is what RHEL-8 uses and hence is still quite widespread unfortunately.

The issue can be shown easily using iotune. All the below tests are run on an AWS i3en.xlarge instance. Note I am using a patched iotune version that allows manually setting io depth (PR).

Default config:

[ec2-user@rhel8 ~]$ sudo /opt/redpanda/bin/iotune-redpanda --evaluation-dir /mnt/vectorized/redpanda/ --force-io-depth 1 -c 1
INFO  2025-03-25 12:08:40,782 seastar - Reactor backend: epoll
WARN  2025-03-25 12:08:40,789 seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
INFO  2025-03-25 12:08:40,838 [shard 0:main] iotune - /mnt/vectorized/redpanda/ passed sanity checks
INFO  2025-03-25 12:08:40,839 [shard 0:main] iotune - Disk parameters: max_iodepth=1 disks_per_array=1 minimum_io_size=512
INFO  2025-03-25 12:08:40,840 [shard 0:main] iotune - Filesystem parameters: read alignment 512, write alignment 4096
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 314 MB/s (deviation 4%)
Measuring sequential read bandwidth: 710 MB/s (deviation 42%)
Measuring random write IOPS: 91 IOPS
Measuring random read IOPS: 11190 IOPS

Out of the box we are getting a measly 91 IOPS!

Comparing this to "newer" kernels such as 5.14 on RHEL9 we see no such issue:

[ec2-user@rhel9 ~]$ sudo /opt/redpanda/bin/iotune-redpanda --evaluation-dir /mnt/vectorized/redpanda/ --force-io-depth 1 -c 1
INFO  2025-03-25 12:09:32,538 seastar - Reactor backend: linux-aio
INFO  2025-03-25 12:09:32,596 [shard 0:main] iotune - /mnt/vectorized/redpanda/ passed sanity checks
INFO  2025-03-25 12:09:32,596 [shard 0:main] iotune - Disk parameters: max_iodepth=1 disks_per_array=1 minimum_io_size=512
INFO  2025-03-25 12:09:32,596 [shard 0:main] iotune - Filesystem parameters: read alignment 512, write alignment 4096
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 317 MB/s (deviation 18%)
Measuring sequential read bandwidth: 714 MB/s (deviation 41%)
Measuring random write IOPS: 36488 IOPS
Measuring random read IOPS: 6393 IOPS (deviation 4%)

We reach a decent amount of IOPS of 36k (interestingly read perf seems worse but that's for another time).

Playing around with some reactor options back on RHEL8 we see that the main culprit seems to be --poll-aio=1 which is on by default. Turning it off yields a massive improvement of more than an order of magnitude.

[ec2-user@rhel8 ~]$ sudo /opt/redpanda/bin/iotune-redpanda --evaluation-dir /mnt/vectorized/redpanda/ --force-io-depth 1 -c 1 --poll-aio=0
INFO  2025-03-25 12:13:39,722 seastar - Reactor backend: epoll
WARN  2025-03-25 12:13:39,728 seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
INFO  2025-03-25 12:13:39,777 [shard 0:main] iotune - /mnt/vectorized/redpanda/ passed sanity checks
INFO  2025-03-25 12:13:39,778 [shard 0:main] iotune - Disk parameters: max_iodepth=1 disks_per_array=1 minimum_io_size=512
INFO  2025-03-25 12:13:39,779 [shard 0:main] iotune - Filesystem parameters: read alignment 512, write alignment 4096
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 314 MB/s (deviation 13%)
Measuring sequential read bandwidth: 706 MB/s (deviation 41%)
Measuring random write IOPS: 4493 IOPS
Measuring random read IOPS: 10932 IOPS

Further setting --idle-poll-time-us=0 unlocks another ~5x improvement

[ec2-user@rhel8 ~]$ sudo /opt/redpanda/bin/iotune-redpanda --evaluation-dir /mnt/vectorized/redpanda/ --force-io-depth 1 -c 1 --poll-aio=0 --idle-poll-time-us=0
INFO  2025-03-25 12:21:41,005 seastar - Reactor backend: epoll
WARN  2025-03-25 12:21:41,012 seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
INFO  2025-03-25 12:21:41,061 [shard 0:main] iotune - /mnt/vectorized/redpanda/ passed sanity checks
INFO  2025-03-25 12:21:41,061 [shard 0:main] iotune - Disk parameters: max_iodepth=1 disks_per_array=1 minimum_io_size=512
INFO  2025-03-25 12:21:41,061 [shard 0:main] iotune - Filesystem parameters: read alignment 512, write alignment 4096
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 314 MB/s (deviation 13%)
Measuring sequential read bandwidth: 706 MB/s (deviation 40%)
Measuring random write IOPS: 27503 IOPS
Measuring random read IOPS: 9858 IOPS

This is now closer to what we see on newer kernels. Note that only setting idle-poll-time-us shows no improvement over the default nor does disabling thread-affinity.

It's possible to hide this issue by increasing io-depth:

[ec2-user@rhel8 ~]$ sudo /opt/redpanda/bin/iotune-redpanda --evaluation-dir /mnt/vectorized/redpanda/ --force-io-depth 128 -c 1
INFO  2025-03-25 12:41:42,222 seastar - Reactor backend: epoll
WARN  2025-03-25 12:41:42,228 seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
INFO  2025-03-25 12:41:42,277 [shard 0:main] iotune - /mnt/vectorized/redpanda/ passed sanity checks
INFO  2025-03-25 12:41:42,278 [shard 0:main] iotune - Disk parameters: max_iodepth=128 disks_per_array=1 minimum_io_size=512
INFO  2025-03-25 12:41:42,279 [shard 0:main] iotune - Filesystem parameters: read alignment 512, write alignment 4096
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 314 MB/s (deviation 4%)
Measuring sequential read bandwidth: 707 MB/s (deviation 41%)
Measuring random write IOPS: 11661 IOPS
Measuring random read IOPS: 96368 IOPS (deviation 39%)

but even at something like 128 io-depth we still don't reach the perf of newer kernels and at lower throughputs it's often difficult to reach such high io depth in practice. Given this is fixed in newer kernels I am not going to spend much time investigating this further but I suspect the spinning is starving the dio thread or something similar (this is something we have seen in other scenarios).

Given that RHEL8 will likely still be prelevant for quite some time I am wondering whether we should do a kernel check similar to the existing ones that disable poll-aio on these older versions?

StephanDollberg avatar Mar 25 '25 14:03 StephanDollberg

There's no DIO thread. Probably you're using an unqualified filesystem (or unqualified for your workload). For example size-changing writes are challenging for aio.

Try fsqual on that kernel/filesystem combination.

avikivity avatar Mar 25 '25 15:03 avikivity

Sorry I didn't mention, this is all on XFS which yes falls into the whole append challenged category.

fsqual gives the same results on rhel8 and rhel9.

StephanDollberg avatar Mar 25 '25 15:03 StephanDollberg

There's no DIO thread.

Stephan is referring to the kernel dio worker, spawned when doing aio direct IO against FS (to handle completions, etc).

travisdowns avatar Mar 25 '25 17:03 travisdowns

aio works without any thread normally.

avikivity avatar Mar 25 '25 18:03 avikivity

aio works without any thread normally.

There's a dio worker needed to handle write completions, Jens describes it (and removes it some scenarios, but this is very recent and I don't think affects libaio) here: https://lwn.net/Articles/937997/

Run any heavy write workload using aio and you can see the dio workers use non-trivial CPU.

travisdowns avatar Mar 26 '25 03:03 travisdowns

@avikivity - let's let the DIO worker/thread thing go for now: it was only an example of something that might be going wrong.

Can you please give the OP a fresh look. This is the most vanilla workload you can imagine: seastar's own iotune tool on the "best FS" for libaio, XFS. So we are not using random stuff here: I have no doubt the same issue arises for anyone running ScyllaDB on RHEL8 too.

It's not just a little bit bad: it's 91 iodepth=1 IOPS on a disk that normally does > 10,000 no problem. That's like transforming your SSD into a HDD and getting nothing in return.

Stephan shows it's directly related to hot polling.

travisdowns avatar Mar 26 '25 17:03 travisdowns

I don't have a clue. 91 IOPS = 10ms latency if everything is serialized. I've never seen anything like it. Suggest using kernel-level tools to understand.

avikivity avatar Mar 26 '25 17:03 avikivity

91 IOPS = 10ms latency if everything is serialized.

Yes exactly.

Suggest using kernel-level tools to understand.

Given the limited but not zero use of RHEL8 in practice, and that this issue seems relegated to these old kernels/distros, we'd rather not go down the rabbit hole all the way: instead Stephan has proposed and tested a fix for these old kernels: eliminate hot polling by default.

travisdowns avatar Mar 26 '25 20:03 travisdowns

This seems to be specifically caused by the usage of an older version of tuned on rhel8.

On AWS at least it uses the virtual-guest profile by default. That inherits most of its values from throughput-performance which I think is a fairly default one.

The key thing it does is to set /proc/sys/kernel/sched_wakeup_granularity_ns to 15000000 which then causes the bad performance described above.

In fact the same can be reproduced on newer kernels by setting that value to the same. I haven't tried on EEVDF/6.6+ yet.

RHEL 9 is not affected as a newer versions of tuned remove the explicit setting of the scheduler values: https://github.com/redhat-performance/tuned/commit/c6d6fdcc4c944df9998e0ebe75f31cc8aed452c1

So probably disabling poll-aio depending on the kernel version is not a good idea.

StephanDollberg avatar Mar 28 '25 10:03 StephanDollberg

91 IOPS = 10ms latency if everything is serialized.

Yes exactly.

Suggest using kernel-level tools to understand.

Given the limited but not zero use of RHEL8 in practice, and that this issue seems relegated to these old kernels/distros, we'd rather not go down the rabbit hole all the way: instead Stephan has proposed and tested a fix for these old kernels: eliminate hot polling by default.

I don't want to merge something without understanding it.

avikivity avatar Mar 31 '25 15:03 avikivity

This seems to be specifically caused by the usage of an older version of tuned on rhel8.

On AWS at least it uses the virtual-guest profile by default. That inherits most of its values from throughput-performance which I think is a fairly default one.

The key thing it does is to set /proc/sys/kernel/sched_wakeup_granularity_ns to 15000000 which then causes the bad performance described above.

In fact the same can be reproduced on newer kernels by setting that value to the same. I haven't tried on EEVDF/6.6+ yet.

RHEL 9 is not affected as a newer versions of tuned remove the explicit setting of the scheduler values: redhat-performance/tuned@c6d6fdc

So probably disabling poll-aio depending on the kernel version is not a good idea.

Aha. Perhaps Seastar can warn on this bad configuration.

It does explain your results.

avikivity avatar Mar 31 '25 15:03 avikivity

Note ScyllaDB explicitly tunes this variable, and rejects the tuned package.

avikivity avatar Mar 31 '25 15:03 avikivity

What's an EEVDF/6.6+?

avikivity avatar Mar 31 '25 15:03 avikivity

It does explain your results.

I can further confirm now that it is indeed the starving of the dio kworker thread that is causing this. Renice-ing to something higher than the seastar process does give the expected performance.

Perhaps Seastar can warn on this bad configuration.

Yes, we are potentially even considering disabling poll-aio depending if the value is set. It's hard to come up with a generic check. Though it might be fine to just compare against the specific values that tuned sets as newer versions don't set it at all anymore.

Note ScyllaDB explicitly tunes this variable, and rejects the tuned package.

Ah, I see scylla_tune_sched now and also that you straight out uninstall/obsolete tuned with the scylla package. I am wondering how you solve this for k8s where scylla will get installed inside the pods and you don't really have any influence over the host?

What's an EEVDF/6.6+?

It's the new Linux Earliest Eligible Virtual Deadline First scheduler that ships with 6.6+. I see that your tuning script already adjusts the tunables for that as well.

StephanDollberg avatar Mar 31 '25 18:03 StephanDollberg

It does explain your results.

I can further confirm now that it is indeed the starving of the dio kworker thread that is causing this. Renice-ing to something higher than the seastar process does give the expected performance.

I don't know what this dio kworker thread is. The expectation is that aio does not call into require any thread to work.

Maybe things changed since I last looked at it deeply.

And I think they must have - IIRC we tuned the scheduler so that the networking code (that does run in a thread) doesn't stall userspace.

Perhaps Seastar can warn on this bad configuration.

Yes, we are potentially even considering disabling poll-aio depending if the value is set. It's hard to come up with a generic check. Though it might be fine to just compare against the specific values that tuned sets as newer versions don't set it at all anymore.

Note ScyllaDB explicitly tunes this variable, and rejects the tuned package.

Ah, I see scylla_tune_sched now and also that you straight out uninstall/obsolete tuned with the scylla package. I am wondering how you solve this for k8s where scylla will get installed inside the pods and you don't really have any influence over the host?

We basically hope that nothing bad happens. Kubernetes is really bad for something like Seastar if you don't control the host.

What's an EEVDF/6.6+?

It's the new Linux Earliest Eligible Virtual Deadline First scheduler that ships with 6.6+. I see that your tuning script already adjusts the tunables for that as well.

Wow, yet another thing that passed me by.

avikivity avatar Mar 31 '25 18:03 avikivity

What's an EEVDF/6.6+?

It's the new Linux Earliest Eligible Virtual Deadline First scheduler that ships with 6.6+. I see that your tuning script already adjusts the tunables for that as well.

Wow, yet another thing that passed me by.

I wonder if we can learn something from it for the Seastar scheduler (which was based on my incomplete understanding of CFS). Though I'm not aware of weaknesses in CPU scheduling.

avikivity avatar Mar 31 '25 18:03 avikivity

It does explain your results.

I can further confirm now that it is indeed the starving of the dio kworker thread that is causing this. Renice-ing to something higher than the seastar process does give the expected performance.

I don't know what this dio kworker thread is. The expectation is that aio does not call into require any thread to work.

Maybe things changed since I last looked at it deeply.

To explain more: aio/dio works by looking up the file->disk mapping from memory (and stalling if it's not there, but it always is), then issuing the I/O, then responding to the disk completion interrupt and writing the aio completion entry from the interrupt handler, and waking up the reactor thread (if it was sleeping), again from the interrupt handler.

Maybe something changed in aio, or the interrupts are processed in a (non-realtime?!) thread.

avikivity avatar Mar 31 '25 19:03 avikivity

I see that EEVDF now respects the scheduler parameters in sched_setattr: https://lwn.net/ml/linux-kernel/[email protected]/

So perhaps we can call this on our own threads (but I don't know if it helps when there are kernel threads that inherit the default).

avikivity avatar Mar 31 '25 20:03 avikivity

and writing the aio completion entry from the interrupt handler, and waking up the reactor thread (if it was sleeping), again from the interrupt handler.

Maybe something changed in aio, or the interrupts are processed in a (non-realtime?!) thread.

I am no expert in this area but yes it looks like the latter to me. We already ran into the same "dio thread is being starved" issue in a similar scenario in k8s which sets up the cgroups scheduling weight in a weird way that favors the reactor thread over the dio thread and hence can starve it.

From looking at perf sched traces back then it always seemed to be "aio completitions" that happened on that kernel thread.

From a quick look at the source (again no expert there), it's described here https://github.com/torvalds/linux/blob/master/fs/super.c#L2168-L2172.

Looks like that when using the iomap stuff all writes complete via that queue here: https://github.com/torvalds/linux/blob/master/fs/iomap/direct-io.c#L222-L223 (IOMAP_DIO_WRITE not being handled so we fall through to the else) which calls back into aio_complete_rw.

StephanDollberg avatar Apr 01 '25 11:04 StephanDollberg

and writing the aio completion entry from the interrupt handler, and waking up the reactor thread (if it was sleeping), again from the interrupt handler. Maybe something changed in aio, or the interrupts are processed in a (non-realtime?!) thread.

I am no expert in this area but yes it looks like the latter to me. We already ran into the same "dio thread is being starved" issue in a similar scenario in k8s which sets up the cgroups scheduling weight in a weird way that favors the reactor thread over the dio thread and hence can starve it.

From looking at perf sched traces back then it always seemed to be "aio completitions" that happened on that kernel thread.

From a quick look at the source (again no expert there), it's described here https://github.com/torvalds/linux/blob/master/fs/super.c#L2168-L2172.

"This avoids creating workqueue for filesystems that don't need it"

XFS was supposed not to need it!

Looks like that when using the iomap stuff all writes complete via that queue here: https://github.com/torvalds/linux/blob/master/fs/iomap/direct-io.c#L222-L223 (IOMAP_DIO_WRITE not being handled so we fall through to the else) which calls back into aio_complete_rw.

I can understand this happening once or twice: the file's map isn't there (could happen after a cold start), so we have to have a thread to handle the completion (since the kernel lacks our wonderful continuations and coroutines). But not for every write.

Alternatively, iotune-redpanda uses O_DSYNC and issues size-changing writes. These are not handled well by XFS and it would be better to get rid of them even on kernels without the bad scheduling granularity.

avikivity avatar Apr 02 '25 17:04 avikivity

This seems to be specifically caused by the usage of an older version of tuned on rhel8. On AWS at least it uses the virtual-guest profile by default. That inherits most of its values from throughput-performance which I think is a fairly default one. The key thing it does is to set /proc/sys/kernel/sched_wakeup_granularity_ns to 15000000 which then causes the bad performance described above. In fact the same can be reproduced on newer kernels by setting that value to the same. I haven't tried on EEVDF/6.6+ yet. RHEL 9 is not affected as a newer versions of tuned remove the explicit setting of the scheduler values: redhat-performance/tuned@c6d6fdc So probably disabling poll-aio depending on the kernel version is not a good idea.

Aha. Perhaps Seastar can warn on this bad configuration.

Perhaps, warn and auto-disable io polling.

Note: with io_uring, we don't need (and don't have?) io polling.

avikivity avatar Apr 02 '25 17:04 avikivity

I can understand this happening once or twice: the file's map isn't there (could happen after a cold start), so we have to have a thread to handle the completion (since the kernel lacks our wonderful continuations and coroutines). But not for every write.

Alternatively, iotune-redpanda uses O_DSYNC and issues size-changing writes.

We are just using vanilla scyalldb/seastar iotune.

These are not handled well by XFS and it would be better to get rid of them even on kernels without the bad scheduling granularity.

We do not do size-changing writes, we manually fallocate/ftruncate the file out in large blocks when we approach EOF.

However, when writing a file linearly even with the above chunky fallocate/ftruncate to extend EOF, writes are still often updating XFS metadata/journal because they are "unwritten extent conversion" (UWEC) writes: i.e., the first write to a logical block. Before this first write the block just has garbage and XFS flags it as unwritten in its metadata so reads are short-circuited and return zero w/o actually reading the block. The first write then needs to update this metadata and mark the block as written. So all these "first writes" are a large source of metadata churn in XFS, and are something that is handled in the DIO completion flow. They cause other problems too, e.g., a source of EAGAIN returns from io_submit are UWEC writes which need to update metadata and io_submit NOWAIT path bails out in some cases when that needs to happen.

I don't know what the write pattern is in scylla: are you usually overwriting existing data? In that case you'd never notice this issue.

travisdowns avatar Apr 02 '25 19:04 travisdowns

The only practical way I know of to avoid UWEC issues are to actually zero large blocks of the file ahead of application use in a chunky way: but this "real IO" and would double your IO count for some workloads. In the far future maybe we will be able to use TRIM + RZAT ("Deterministic Read Zero after TRIM") to have cheap logical zeros implemented by the FTL, but I'm not holding my breath.

travisdowns avatar Apr 02 '25 19:04 travisdowns

I can understand this happening once or twice: the file's map isn't there (could happen after a cold start), so we have to have a thread to handle the completion (since the kernel lacks our wonderful continuations and coroutines). But not for every write. Alternatively, iotune-redpanda uses O_DSYNC and issues size-changing writes.

We are just using vanilla scyalldb/seastar iotune.

These are not handled well by XFS and it would be better to get rid of them even on kernels without the bad scheduling granularity.

We do not do size-changing writes, we manually fallocate/ftruncate the file out in large blocks when we approach EOF.

However, when writing a file linearly even with the above chunky fallocate/ftruncate to extend EOF, writes are still often updating XFS metadata/journal because they are "unwritten extent conversion" (UWEC) writes: i.e., the first write to a logical block. Before this first write the block just has garbage and XFS flags it as unwritten in its metadata so reads are short-circuited and return zero w/o actually reading the block. The first write then needs to update this metadata and mark the block as written. So all these "first writes" are a large source of metadata churn in XFS, and are something that is handled in the DIO completion flow. They cause other problems too, e.g., a source of EAGAIN returns from io_submit are UWEC writes which need to update metadata and io_submit NOWAIT path bails out in some cases when that needs to happen.

I was imprecise, but that's what I meant. Avoid this unwritten extent conversion. In ScyllaDB's commitlog, which uses O_DSYNC write, we first zero-format the segment using non-O_ODYNC writes, then fdatasync that, then overwrite. Later we recycle the segment to avoid the 2X write amplification.

I don't know what the write pattern is in scylla: are you usually overwriting existing data? In that case you'd never notice this issue.

In the commitlog, we overwrite as described above. For sstables, we append, but without O_DSYNC (and with fdatasync afterwards), and with the nice extent size hint.

avikivity avatar Apr 02 '25 19:04 avikivity

Later we recycle the segment to avoid the 2X write amplification.

Yes, exactly. Regardless of the O_DSYNC and other stuff (this is not obviously better, as discussed elsewhere) that's what's needed to avoid the amplification, currently we don't recycle files. If you recycle files then you are doing pure overwrites and this can be a lot better.

That said, I should clarify that I don't think it's UWEC that causes completion workqueue use: as far as I can tell XFS has used the workqueue for all AIO writes going back a long time. Only quite recently Jens has been adding this IOMAP_DIO_CALLER_COMP stuff which can avoid it for XFS, but I think only because iouring has another safe place to call the completion handler from (so using the work queue was redundant).

So even if you avoid UWEC I'm pretty sure you still suffer the roundtrip through the workqueue for every write.

travisdowns avatar Apr 02 '25 19:04 travisdowns

Later we recycle the segment to avoid the 2X write amplification.

Yes, exactly. Regardless of the O_DSYNC and other stuff (this is not obviously better, as discussed elsewhere) that's what's needed to avoid the amplification, currently we don't recycle files. If you recycle files then you are doing pure overwrites and this can be a lot better.

That said, I should clarify that I don't think it's UWEC that causes completion workqueue use: as far as I can tell XFS has used the workqueue for all AIO writes going back a long time. Only quite recently Jens has been adding this IOMAP_DIO_CALLER_COMP stuff which can avoid it for XFS, but I think only because iouring has another safe place to call the completion handler from (so using the work queue was redundant).

So even if you avoid UWEC I'm pretty sure you still suffer the roundtrip through the workqueue for every write.

As far as I know there's no workqueue involved in extending writes without O_DSYNC or non-extending writes (or unwritten extent conversion) with O_DSYNC. This is what fsqual measures - context switches involved with writes. If you received a GOOD score, no context switches, so no workqueues.

avikivity avatar Apr 02 '25 19:04 avikivity

Seastar applications should see a decreasing context switch rate as the workload increases (if you have dedicated networking cores).

avikivity avatar Apr 02 '25 19:04 avikivity

This is what fsqual measures - context switches involved with writes. If you received a GOOD score, no context switches, so no workqueues.

Hmm, I think fsqual led you astray: it only measures the context switches in io_submit specifically, and it only measures "voluntary" context switches, while dio interruption would be involuntary (to my understanding).

Indeed, with the existing code I get "0" for the case I'm thinking about here (which I expect to be 1), though I get 1 for higher iodepths:

context switch per write io (size-unchanging, append, blocksize 4096, iodepth 1): 0 (GOOD) 
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD) 
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD)

If I record more data I get this:

context switch per write io (size-unchanging, append, blocksize 4096, iodepth 1): 0 0 0 1 (GOOD) 
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 0 0 0.3334 (GOOD) 
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 0 0 0.1429 (GOOD

The numbers 0 0 0 1 are in order:

voluntary CI in io_submit (existing reported value) voluntary CI in io_getevents voluntary CI across the entire loop involuntary CI across the entire loop

So indeed, on XFS, this UWEC case, non-dsync is taking 1 context switch per iteration, but if it's voluntary or not depends: if it causes a block in io_getevents its voluntary, and then all the work happens before the process gets control again, I guess (so there is no second context switch). In the cases where the iosubmit does not switch, you still get a switch later, involuntary. This is the switch that @StephanDollberg has identified at the root of a couple of different write perf issues now.

travisdowns avatar Apr 02 '25 20:04 travisdowns

sorry forgot the full results:

memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   4096
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 0 0 1 (GOOD) 
context switch per write io (size-changing, append, blocksize 4096, iodepth 2): 0.9995 0 0.9995 0.0003 (BAD) 
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8012 0 0.8012 0.0002 (BAD) 
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 1): 0 0 0 1 (GOOD) 
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 0 0 0.3334 (GOOD) 
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 0 0 0.1429 (GOOD) 
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 0 0.125 0.8751 (BAD) 
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 0 0 1 (GOOD) 
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 0 0 1.0014 (GOOD) 
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 0 0 0.9716 (GOOD) 
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 0 0 0.9863 (GOOD) 
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 0 0 1 (GOOD) 
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8017 0 0.8017 0.0001 (BAD) 
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 0 0 0.3335 (GOOD) 
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 0 0 0.143 (GOOD) 
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.1251 0 0.1251 0.8752 (BAD) 
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 0 0 1.0004 (GOOD) 
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 0 0 1.0013 (GOOD) 
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 0 0 0.9784 (GOOD) 
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 0 0 0.0002 (GOOD) 

every iodepth=1 case takes at least 1 context switch per write, no matter how "ideal", higher iodepths take less because the work in the kernel worker can be amortized over more writes.

travisdowns avatar Apr 02 '25 20:04 travisdowns

Perhaps, warn and auto-disable io polling.

I have suggested a PR here: https://github.com/scylladb/seastar/pull/2715 . We can discuss details there.

Btw I just tested with the io_uring backend on ubu 20/5.15 and it's also being affected by a too high wakeup granularity value. Though too a lesser extent - rand IOPS "only" drop by ~8x.

StephanDollberg avatar Apr 03 '25 11:04 StephanDollberg