ZIO: add "vdev tracing" facility; use it for ZIL flushing
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
In #16314 I noted that the change addresses the correctness issues, but introduces a performance problem when the pool is degraded. This PR resolves that.
This PR is dependent on #16314, so those commits are included here. I suggest reviewing that PR first; the additional change to that is in the top commit.
Description
A problem with zio_flush() is that it issues a flush ZIO to a top-level vdev, which then recursively issues child flush ZIOs until the real leaf devices are flushed. As usual, an error in a child ZIO results in the parent ZIO erroring too, so if a leaf device has failed, it's flush ZIO will fail, and so will the entire flush operation.
This didn't matter when we used to ignore flush errors, but now that we propagate them, the flush error propagates into the ZIL write ZIO. This causes the ZIL to believe its write failed, and fall back to a full txg wait. This still provides correct behaviour for zil_commit() callers (eg fsync()) but it ruins performance.
We cannot simply skip flushing failed vdevs, because the associated write may have succeeded before the vdev failed, which would give the appearance the write is fully flushed when it is not. Neither can we issue a "syncing write" to the device (eg SCSI FUA), as this also degrades performance.
The answer is that we must bind writes and flushes together in a way such that we only flush the physical devices that we wrote to.
This adds a "vdev tracing" facility to ZIOs, zio_vdev_trace. When enabled on a ZIO with ZIO_FLAG_VDEV_TRACE, then upon successful completion (in the _done handler), zio->io_vdev_trace_tree will have a list of zio_vdev_trace_t objects that each describe a vdev that was involved in the successful completion of the ZIO.
A companion function, zio_vdev_trace_flush(), is included, that issues a flush ZIO to the child vdevs on the given trace tree. zil_lwb_write_done() is updated to use this to bind ZIL writes and flushes together.
The tracing facility is similar in many ways to the "deferred flushing" facility inside the ZIL, to the point where it can replace it. Now, if the flush should be deferred, the trace records from the writing ZIO are captured and combined with any captured from previous writes. When its finally time to issue the flush, we issue it to the entire accumulated set of traced vdevs.
Further reading
I presented this work at AsiaBSDCon 2024. Paper, slides and other notes available at: https://despairlabs.com/presentations/openzfs-fsync/
How Has This Been Tested?
A test is included to provide some small proof that it works. Without the change, it will fail, because it will see a txg sync fallback.
Full ZTS run passes.
Also in production at a customer site, and appears to be working well.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [x] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [x] I have added tests to cover my changes.
- [x] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by.
Making this a draft for now. I still think the technique is good for what it is, but I've been doing more work on flush response in spa_sync() and it needs a whole different mechanism (for reasons that I won't go into here). So I want to wait until that is fleshed out, then I can revisit this (and #16314) to see if it can be re-expressed in terms of whatever I come up with, or if we will need both mechanisms.
Obsoleted by #17065.