zfs ZIL: "crash" the ZIL if the pool suspends during fallback

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

Currently, if something goes wrong inside the ZIL, zil_commit() it will fall back to a full txg sync, which provides the same semantics, just slower. If pool suspends, txg_wait_synced_flags() will block until the pool resumes. The end result is that syncing ops like fsync() block regardless of how failmode= is set.

Since #17355, txg_wait_synced_flags() can bail out if the pool suspends while waiting. We can use this feature to not block zil_commit() if the pool suspends, but instead unwind and return an error. Once wired back through, this allows fsync() and friends to return EIO rather than blocking forever.

This is an obvious good, as it allows applications to take alternate action when the pool suspends, rather than just waiting forever.

(Further information in my BSDCan 2024 talk “Why fsync() on OpenZFS can’t fail, and what happens when it does”, if you like that sort of thing).

Description

Going commit-by-commit.

ZTS: test response of various sync methods under different failmodes

Test suite additions that confirm behaviour of various syncing ops (fsync(), msync(), write() after O_SYNC, and write() with sync=always), with failmode=wait and failmode=continue. The short of it is that when the pool suspends, syncing ops should block in failmode=wait, and return error with failmode=continue.

I couldn’t find a convenient program that would do a dd-like write+sync using mmap()+msync(), so I wrote one just for this test.

ZIL: allow zil_commit() to fail with error

This changes zil_commit() to return int, and then updates all call sites to have an appropriate error handling. Most of these are fairly mechanical, being the handling for sync=always. Any tricky ones have comments.

Of interest, I’ve made zil_commit() be __attribute__((__warn_unused_result__)), causing the compiler to check that the result is being used and throw a warning (== error) if not. This was initially to help smoke out all the call sites, but I decided to leave it in, on the idea that if you are calling zil_commit(), you are indicating that it is important to you that the data gets onto disk, and you need to take care of it if it’s not. I personally think it’s a slam-dunk for programmer confidence, but I’m happy to discuss it if you don’t feel great about it.

ZIL: pass commit errors back to ITX callbacks

ZIL transactions (itx_t) can optionally have a callback, which will be called when it is destroyed (committed or not). This is used in one place: on Linux, a syncing writepage (that is, msync()) will set a callback so that it can mark the page clean once the op has completed.

Since that’s a syncing op that we want to fail if the ZIL crashes, we pass any error back to the callbacks as well. For the existing callbacks (zfs_putpage_sync_commit_cb(), zfs_putpage_async_commit_cb()), we handle the error by keeping the page dirty, and also setting its error flag. This prevents the page being evicted until is written back cleanly, and is propagated back up to msync() later when the kernel calls vfs_sync_range() -> zpl_fsync() -> filemap_write_and_wait_range().

With #17445, FreeBSD gets equivalent changes. We don't currently have the sync/async tracking there, so we simply force a writeback on in zfs_freebsd_fsync(), and if the callbacks receive errors, we unlock and notify, but leave the page dirty.

ZIL: "crash" the ZIL if the pool suspends during fallback

The main event. Now we have the ability to pass errors back to callers, we need to generate some when things go wrong!

The basic idea is straighforward: instead of falling back to txg_wait_synced(), which blocks on suspend, we change all calls to txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error return back to the zil_commit() caller.

Handling suspension means returning an error to all commit waiters. This is relatively straightforward, as zil_commit_waiter_t already has zcw_zio_error to hold the write IO error, which signals a fallback to txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the waiter can now return an error from zil_commit().

However, commit waiters are normally signalled when their associated write (LWB) completes. If the pool has suspended, those IOs may not return for some time, or maybe not at all. We still want to signal those waiters so they can return from zil_commit(). We have a list of those in-flight LWBs on zl_lwb_list, so we can run through those, detach them and signal them. The LWB itself is still in-flight, but no longer has attached waiters, so when it returns there will be nothing to do. ITXs are directly connected to LWBs, so we can destroy them there too, passing the error to pass on to their completion callbacks.

At this point, all ZIL waiters have been ejected, so we only have to consider the internal state. We potentially still have ITXs that have not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL is in an unknown state; some writes may have been written but not returned to us. We really can't rely on any of it; the best thing to do is abandon it entirely and start over when the pool returns to service. But, since we may have IO out that won't return until the pool resumes, we need something for it to return to.

The simplest solution, implemented here, is to "crash" the ZIL: accept no new ITXs, make no further updates, and let it empty out on its normal schedule, that is, as txgs complete and zil_sync() and zil_clean() are called. We set a "restart txg" to four txgs in the future (syncing + TXG_SIZE), at which point all the internal state will have been cleared out, and the ZIL can resume operation (handled at the top of zil_clean()).

This commit adds zil_crash(), which handles all of the above:

sets the restart txg
capture and signal all waiters
zero the header

zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND) returns because the pool suspended (ESHUTDOWN).

The rest of the commit is just threading the errors through, and related housekeeping.

Discussion

The `failmode=` check

I’ve put the failmode= check at the bottom of zil_commit(), falling back to a “forever” txg_wait_synced() if the pool suspends and failmode=wait. There are two possible related issues, one structural, one design.

First, I think this isn’t the best place for this check. I feel like zil_commit() should always return EIO if the pool suspends, because it’s sort of an “off to the side” component. However, it’s very convenient to do the check here because of all the ZPL callers that would have to do those checks themselves if it wasn’t there. It might be better to do it in a wrapper, or with a flag.

The design question is this. I think there is a case to be made that syncing ops should always return EIO if they cannot be serviced now, regardless of the setting of failmode=. I admit it’s hard to draw a line, but I sort of see it that a basic async write() is always kind of a best effort, while something like fsync() is explicitly an application asking for something to be on disk. Any well-designed fsync() caller has to have a solid error strategy; why should it be delayed from enacting it? It’s a situation where it almost by definition it needs an EIO handler, while ops like write() are often not checked as rigourously.

On balance, I think gating it behind failmode=continue for now is a good idea since it represents such a fundamental change to how OpenZFS works that we should tread carefully, but I’m curious to hear if people have any thoughts.

No error checks in tests

The tests only check whether or not the syncing op blocks or not, not whether or not it returns an error. This is deliberate. Until we resolve the problem that flush errors don’t propagate to write errors (both in the ZIL and in the main spa_sync() line), syncing ops will always be able to erroneously return success instead of failure. This PR is mostly only about unblocking the syncing op, though we return errors when we can.

What I think will be the final change to handle flush errors as write errors should be in a PR in a week or two. It’s no secret that I’ve attempted this multiple times before, and I think I’ve got something workable this time. It’s extremely invasive though, so I don’t want to mix it up with this PR because it’ll just make it even harder to review.

FreeBSD page writeback errors

Resolved when #17445 lands. This PR is based on top of that, and makes the necessary changes to handle writeback errors on FreeBSD too.

Previous discussion below, for interest:

Our page writeback handler for FreeBSD doesn’t appear to have any ability to handle writeback errors in its current form, and I’m not sure what to do about it. From zfs_putpages():

		/*
		 * XXX we should be passing a callback to undirty
		 * but that would make the locking messier
		 */
		zfs_log_write(zfsvfs->z_log, tx, TX_WRITE, zp, off,
		    len, commit, B_FALSE, NULL, NULL);

		zfs_vmobject_wlock(object);
		for (i = 0; i < ncount; i++) {
			rtvals[i] = zfs_vm_pagerret_ok;
			vm_page_undirty(ma[i]);
		}
		zfs_vmobject_wunlock(object);
		VM_CNT_INC(v_vnodeout);
		VM_CNT_ADD(v_vnodepgsout, ncount);

If I’m understanding that correctly, it’s adding the ITX to the ZIL, then immediately marking the pages clean and error-free. If that’s true, then it might mean msync() will always return success and the page be made clean and evictable, even if it’s not on disk. That almost certainly needs to be fixed, though perhaps we can get away with it a little longer, since this whole PR only changes behaviour with failmode=continue, which isn’t widely used, and the fallback case is not totally reliable on errors, as above.

In any case, I will be trying to rope in a FreeBSD person to help with this bit. Let me know if that’s you!

How Has This Been Tested?

Full ZTS run, including new tests, completed successfully against Linux 6.1.137 and FreeBSD 14.2-p1 (though I'll concede, a few recent PRs of mine have come with full test runs locally and still tripped up in CI, so I should at least work out how to rephrase this boilerplate!)

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Quality assurance (non-breaking change which makes the code more robust against bugs)
[x] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[x] I have read the contributing document.
[x] I have added tests to cover my changes.
[x] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

May 29 '25 08:05 robn

I have subtle feeling that you are overdoing something with the zl_lwb_crash_list. ZIL already has a mechanism to chain errors via lwb_error, and error on one LWB automatically mean the same error on all the following, without even issuing them, unless they are already issued. And proper ZIL operation is restored only when some transaction commit finally succeed to restart the ZIL chain, which means the pool iv obviously back alive. I got your idea of removing ITX'es and calling callbacks for potentially stuck LWB ZIOs, it is much better than attempt to terminate stuck ZIOs. But do you need some other LWB list and special cleanup for that?

I wouldn't be surprised. All the way through I wondered if I was overdoing it, so I've tried to make it as safe as possible. My concern was that with disks gone, what if some IO never returned and/or some returned out of order? Could IO not return even after the pool had resumed? I don't even know if any of this is possible but I was trying to avoid assuming anything at all about the lost IO, and just make sure there was always some stub lying around in case it ever did come back.

I'll have another read through next week; maybe with fresh eyes it'll be obvious where it's overdone (it's actually been months since I've though very hard about this).

You definitely understand the ZIL better than I do though, so if anything jumps out at you as obviously unnecessary, please do let me know!

May 30 '25 12:05 robn

I've spent most of the day studying how page writeback errors wire up to this (wrt https://github.com/openzfs/zfs/pull/17398#discussion_r2119848333 above) and it seems it all needs work.

As best I can tell, we're handling async/sync page writeback correctly (ie blocking when we need to block), but I'm pretty sure we're not really holding the kernel tools correctly to ensure errors are all propagated properly. Not really surprising; we never had to do it before!

It doesn't affect the bulk of this work (in zil.c); it's all in the dance between zfs_putpage(), zpl_sync(), and the itx callbacks. I think I know what to do with it all; just need to work through it, and write a test for it. That'll be tomorrow I think.

Jun 02 '25 06:06 robn

Alright, last push has a new commit, sorting out the error reporting and page dirtying dance between zpl_fsync(), zfs_putpage() and the itx callbacks. I believe its (close to) right, but this is deep and subtle.

Before the end of the week I will try to write or extend a test to exercise combinations of mapped and not-mapped writes and syncs when pool suspension is involved.

Jun 04 '25 10:06 robn

I've got some more changes coming on this too, though I am fast running out of week for it:

incorporate (rebase atop) #17420, and make sure the zil_commit() call in zfs_sync() gets back to syncfs() correctly
some kind of "nowait" or "callback" zil_commit(). There's now two places where it's only called to get things writing, and it would be very useful for the "nowait" mode of zfs_sync() as well. At least, I want a mode which doesn't consider failmode=, just returns, but I think in both styles the error is important, so I need to think a little about that.
write the page writeback callbacks for FreeBSD. I suspect I know enough now to identify the shapes I need in the FreeBSD kernel and hook it up

Jun 04 '25 10:06 robn

Last push rebased on #17420 and #17445, and includes matching changes to both, most importantly that the FreeBSD putpage callbacks now receive and handle errors too.

I'm still chewing on variations of zil_commit(). I do think I want to be able to separate out the failmode= check, and will look at it soon.

I feel like being able to start writing ahead of time is more a reflection of some existing structural problem.

A true async version could be achieved by returning a commit waiter, or having a callback on the commit itx, or stashing the commit waiter for the next call and have some kind of cookie to reference it. But the actual question is "why?", because it seems like a call to zil_commit() is saying you want everything it's got on disk now, and it's only the async/sync counts and the writeback error states complicating matters, so maybe that should all be moved inside the ZIL and there be other ways to signal what it is you want (using the ZIL to help with things that aren't strictly sync, I guess?).

I think that needs a bit more thought, but outside of this PR. It's not necessary for this, it'll just be nicer code at the end, and I'd rather not blow out this diff any further.

Jun 10 '25 03:06 robn

Ok, last commit is the final form I believe.

Rebased to master, now that #17420 and #17435 are merged.

Here I add zil_commit_flags(), following the trend set by dmu_tx_assign_flags() and txg_wait_synced_flags(). Just one flag for now, ZIL_COMMIT_FAILMODE, to indicate whether failmode= should be considered if the pool suspended. If set, and failmode=wait, will just drop into txg_wait_synced(), then return 0.

zil_commit() becomes a wrapper that always sets ZIL_COMMIT_FAILMODE, making it the easy choice for most calls (ie VFS ops).

Jun 17 '25 02:06 robn

Last push fixes the two comments mentioned, and a handful of other crimes against English.

@behlendorf thank you for your time!

Jul 31 '25 06:07 robn

"its complicated"

😰

Aug 04 '25 01:08 robn

@amotin asked me what would happen if in zil_lwb_write_done(), we call list_next(&zilog->zl_lwb_list, lwb) while lwb is not on zl_lwb_list but rather, on zl_lwb_crash_list. I said "nothing good, I bet" and went back to digging my hole.

Last push adds an early commit to convert the separate lwb_slim and lwb_slog fields into flags on a new lwb_flags field. Then, in the proper "crash the ZIL" commit, I add a new flag LWB_FLAG_CRASHED, which gets set on lwbs as they are moved onto zl_lwb_crash_list. Later, in zil_lwb_write_done(), I don't even bother looking at lists or doing any postprocessing on lwbs with that flag, as their callers have already been dealt with, and they'll never need flushing. They just get hurried on to the next state.

Aug 04 '25 03:08 robn

@robn after a rebase I believe this should be good to go.

Aug 06 '25 17:08 behlendorf

Last push rebases and resolves the significant conflicts.

I'm not entirely sure if the async/sync counters were actually handling it before, or if they were hiding some details from view, but after reviewing it again I realised I'd made a mistake that might not have been a problem in practice, but could be in theory.

In zpl_fsync()/zpl_freebsd_fsync(), we'd flush out any dirty pages async, ignoring errors, because the assumption was that if the writeback failed, the pages would be left dirty, and the followup zfs_fsync()->zil_commit() would get the error.

The first part is certainly true, but I think the second is wrong. The main reason that async writeback could fail is if dmu_tx_assign(DMU_TX_WAIT) failed. In that case, the page is not written to the DMU, and we've returned from putpage early, so it won't be logged either. But, because its not on the ZIL, there's nothing for it to actually write out.

In practice, I don't think it could have hurt, because dmu_tx_assign(DMU_TX_WAIT) should only fail if the pool suspends and failmode=continue, and in that case zil_commit() will immediately fallback to txg_wait_synced() which will also fail, and so fsync() gets EIO. And for failmode=wait, dmu_tx_assign() would just block. So the end result is the same, but it requires the suspend response to be the same right through, which it might not be in the future if any of that changed (simple example: what if zil_commit() returned success if there was nothing logged for the given object?).

So anyway, I've change that. Now, if we get an async writeback error for fsync(), we just eject right there, because we can't possibly know what state the pages are in right now. And it ends up being kinda simpler anyway, because we don't have to explain why we're ignoring errors!

Aug 07 '25 03:08 robn

Rebased.

Aug 07 '25 22:08 robn

Thank you both for your thoughtful review. I know time and effort it takes and I really appreciate it.

Aug 09 '25 08:08 robn

ZIL: "crash" the ZIL if the pool suspends during fallback

Motivation and Context

Description

ZTS: test response of various sync methods under different failmodes

ZIL: allow zil_commit() to fail with error

ZIL: pass commit errors back to ITX callbacks

ZIL: "crash" the ZIL if the pool suspends during fallback

Discussion

The failmode= check

No error checks in tests

FreeBSD page writeback errors

How Has This Been Tested?

Types of changes

Checklist:

The `failmode=` check