zfs zio: add separate pipeline stages for logical IO

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

The "logical" IO responsible for farming work out to the vdevs goes through the VDEV_IO stages, even though it does no IO itself, does not have a vdev set, and is not a "vdev" child IO.

This means the VDEV_IO stages need special handling for this particular kind of IO, some of it totally irrelevant to real vdev IO (eg config lock, retry, etc). It also leads to some confusing asymmetries, eg the DVA throttle is held in the logical, and then released in pieces in the children.

(I can elaborate on what I'm doing if more justification is needed, but I'm hopeful this stands on its own as a good cleanup).

Description

This commit adds two new stages to the pipeline, ZIO_LOGICAL_IO_START and ZIO_LOGICAL_IO_DONE to handle this IO. This allows a clean separation between logical and vdev IO: vdev IO always has io_vd set, an io_child_type of ZIO_CHILD_VDEV, while logical IO is the inverse. Logical IO only ever goes throught through the LOGICAL_IO pipeline, and vdev IO through VDEV_IO.

This separation presents a new problem, in that previously the logical IO would call into the mirror vdev ops to issue the vdev IO, which is now not possible because non-vdev IOs can't use vdev operations. To keep the overall pipeline tidy, we press the root vdev into service. zio_logical_io_start() creates a child IO against spa_root_vdev, which then delegates to the mirror vdev ops to do its work.

How Has This Been Tested?

Successful ZTS run against Linux 6.1.137 and FreeBSD 14.2-p1.

Throughput appears to be similar in light performance tests, though I have not pushed it very hard.

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[x] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Quality assurance (non-breaking change which makes the code more robust against bugs)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[x] I have read the contributing document.
[ ] I have added tests to cover my changes.
[x] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

May 28 '25 07:05 robn

It also leads to some confusing asymmetries, eg the DVA throttle is held in the logical, and then released in pieces in the children.

It might not look great, but it is done that way intentionally. If we are writing several copies of a block to a several different top-level vdevs, we want each of them to be throttled individually. Otherwise single slow vdev would throttle others, in avoiding which is the point of allocation throttling mechanism.

Previously we even tracked dirty data in individual leaf vdev granularity, but I had to remove that some time ago due to too high processing overhead.

May 28 '25 13:05 amotin

It might not look great, but it is done that way intentionally. If we are writing several copies of a block to a several different top-level vdevs, we want each of them to be throttled individually. Otherwise single slow vdev would throttle others, in avoiding which is the point of allocation throttling mechanism.

Yeah that's fair. I knew why, -ish, but hadn't quite joined up all the dots.

I've put it back the way it was, adjusted for the new model. Tests will run overnight; if it comes up ok I'll push it. Thanks.

May 29 '25 11:05 robn

Rebased, and updated to move the unthrottle back to when the top vdev IO finishes. resilver_restart_002 still fails, pretty sure it'll be something around about where errors are being accounted to. I'll get back to it properly next week.

May 30 '25 09:05 robn

resilver_restart_002 still fails, pretty sure it'll be something around about where errors are being accounted to. I'll get back to it properly next week.

Sorta-kinda. It was actually around where fault injections targeting a DVA/bookmark/objecttype were injected. I had sorta blindly moved them to zio_io_logical_done() because those are logical concepts, however the injections still have to be done at the vdev level, (a) because they may target a specific vdev, and (b) we want to record the vdev as the "source" of the error.

As it turns out, there is exactly one test that uses this (zinject -b), and that's resilver_restart_002. So that's working now.

Jun 05 '25 08:06 robn

I did some semi-scientific test runs this morning. Production (non-debug) build at the given commit. Each run is on a freshly-created pool of 14 disks. 100 threads, either 73 writing + 27 reading, or 100 writing, with or without fsync after each file write. There's various customer tuning applied, but it's the same on both, so I don't claim this is "good" or "bad" in general, but over the years comparison timings have proven very useful, and I believe so here too.

	14x RAIDz3						7x2 mirror
	3300 w73/r27		3301 w73/r27 + fsync		3302 w100	3303 w100 + fsync	3300 w73/r27		3301 w73/r27 + fsync		3302 w100	3303 w100 + fsync
	write	read	write	read	write	write	write	read	write	read	write	write
master @ e0ef4d2768 [PROD]	1134 MB/s	2246 MB/s	387 MB/s	10684 MB/s	1601 MB/s	473 MB/s	1064 MB/s	3722 MB/s	354 MB/s	10755 MB/s	1302 MB/s	431 MB/s

zio-logical-io @ f4480eec0a [PROD]	1139 MB/s	2284 MB/s	386 MB/s	10638 MB/s	1601 MB/s	458 MB/s	1065 MB/s	3731 MB/s	359 MB/s	10688 MB/s	1309 MB/s	430 MB/s
	0.44%	1.69%	-0.26%	-0.43%	0.00%	-3.17%	0.09%	0.24%	1.41%	-0.62%	0.54%	-0.23%

So performance difference seems negligible, and the tests are passing. Would appreciate a proper review on this please!

Jun 12 '25 02:06 robn

I am likely to withdraw this. It destroys read throughput on mirrors, because it ends up waiting for reads spread across multiple mirror vdevs to return to the single extra child IO in the middle (or something very close). I had attributed it to the signifcantly more complicated feature I'm building on top, and found a different way to do what I needed, and only realised after that (a) that wasn't the cause at all and (b) the different and better way didn't end up needing this at all.

(yes, this is vague, don't worry about it).

I still like the conceptual separation, but not if it costs that. I may send a separate PR with some of the comments from here, since a lot of this PR was just because I found parts of the pipeline hard to understand.

Anyway, that's where I'm at. I'll be totally sure tomorrow once I finish my testing, and then I'll withdraw it, or tell you why I changed my mind (oh no).

Aug 21 '25 03:08 robn

Yep, pretty sure I don't need this now. Lets keep making zio_t smaller though, that's gonna help everything :)

Aug 22 '25 09:08 robn