zfs
zfs copied to clipboard
Support forced export of ZFS pools
Motivation and Context
This change enables users to forcibly export a pool in a manner which safely discards all pending dirty data, transaction groups, associated zios, metaslab changes, etc. It allows a regular zpool export
to fail out so that it releases the namespace lock to enable a forced export. It is able to do this regardless of whether the current pool activity involves send, receive, or POSIX I/O.
This allows users to continue using their system without rebooting, in the event the disk cannot be recovered in an online manner, or the user prefers to resume use of their pool at a later time. Since ZFS can simply resume from the most recent consistent transaction group, the latter is easily achieved.
Closes #3461
Description
This is primarily of use when a pool has lost its disk, while the user doesn't care about any pending (or otherwise) transactions.
Implement various control methods to make this feasible:
- txg_wait can now take a NOSUSPEND flag, in which case the caller will be alerted if their txg can't be committed. This is primarily of interest for callers that would normally pass TXG_WAIT, but don't want to wait if the pool becomes suspended, which allows unwinding in some cases, specifically when one is attempting a non-forced export. Without this, the non-forced export would preclude a forced export by virtue of holding the namespace lock indefinitely.
- txg_wait also returns failure for TXG_WAIT users if a pool is actually being force exported. Adjust most callers to tolerate this.
- spa_config_enter_flags now takes a NOSUSPEND flag to the same effect.
- DMU objset "killer" flag which may be set on an objset being forcibly exported / unmounted.
- SPA "killer" flag which may be set on a pool being forcibly exported.
- DMU send/recv now use an interruption mechanism which relies on the SPA killer being able to enumerate datasets and closing any send/recv streams, causing their EINTR paths to be invoked.
- ZIO now has a cancel entry point, which tells all suspended zios to fail, and which suppresses the failures for non-CANFAIL users.
- metaslab, etc. cleanup, which consists of simply throwing away any changes that were not able to be synced out.
- Linux specific: introduce a new tunable, zfs_forced_export_unmount_enabled, which allows the filesystem to remain in a modified 'unmounted' state upon exiting zpl_umount_begin, to achieve parity with FreeBSD and illumos, which have VFS-level support for yanking filesystems out from under users. However, this only helps when the user is actively performing I/O, while not sitting on the filesystem. In particular, this allows test #3 below to pass on Linux.
- Add basic logic to zpool to indicate a force-exporting pool, instead of crashing due to lack of config, etc.
Add tests which cover the basic use cases:
- Force export while a send is in progress
- Force export while a recv is in progress
- Force export while POSIX I/O is in progress
How Has This Been Tested?
- New ZFS Test Suite tests covering the three main pool activity scenarios.
- Testing in a production environment that focuses around use of send/recv to a network-based disk, which can fail and not return "for a while". User doesn't mind losing the last few transaction groups, and is able to pick up later.
- Existing ZFS Test Suite tests, to check for any unexpected breakage of non-forced-export scenarios.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [x] I have added tests to cover my changes.
- [x] I have run the ZFS Test Suite with this change applied.
- [ ] All commit messages are properly formatted and contain
Signed-off-by
.
Pending items on this checklist are pending review. I am seeking feedback on whether to break out any of the commits, rather than squash all of them into the primary one.
Does this address the various zpool/zfs commands blocking when the pool is in distress? i.e. will this command succeed until such conditions or get blocked
Codecov Report
Patch coverage: 62.00
% and project coverage change: +4.49
:tada:
Comparison is base (
161ed82
) 75.17% compared to head (f168cb6
) 79.66%.
:exclamation: Current head f168cb6 differs from pull request most recent head 0c5b2fa. Consider uploading reports for the commit 0c5b2fa to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #11082 +/- ##
==========================================
+ Coverage 75.17% 79.66% +4.49%
==========================================
Files 402 398 -4
Lines 128071 126235 -1836
==========================================
+ Hits 96283 100571 +4288
+ Misses 31788 25664 -6124
Flag | Coverage Δ | |
---|---|---|
kernel | 80.24% <61.60%> (+1.48%) |
:arrow_up: |
user | 65.35% <40.72%> (+17.93%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
include/sys/dmu.h | 100.00% <ø> (ø) |
|
include/sys/dsl_dataset.h | 100.00% <ø> (ø) |
|
module/zfs/vdev_initialize.c | 98.10% <ø> (+4.58%) |
:arrow_up: |
module/zfs/vdev_label.c | 92.22% <0.00%> (+1.35%) |
:arrow_up: |
module/zfs/metaslab.c | 94.52% <12.50%> (+1.91%) |
:arrow_up: |
module/zfs/dsl_dataset.c | 90.82% <18.75%> (-0.44%) |
:arrow_down: |
module/zfs/vdev_rebuild.c | 90.75% <31.57%> (+13.09%) |
:arrow_up: |
module/zfs/dmu_send.c | 84.42% <33.33%> (+5.04%) |
:arrow_up: |
module/zfs/vdev_removal.c | 96.50% <33.33%> (+3.79%) |
:arrow_up: |
module/zfs/vdev_trim.c | 93.50% <38.46%> (+17.50%) |
:arrow_up: |
... and 33 more |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Ran into #9067 on the last full ZTS run I did locally, but it looks better now.
Does this address the various zpool/zfs commands blocking when the pool is in distress? i.e. will this command succeed until such conditions or get blocked
It depends on the specific issue. Most of the time, if a pool is suspended, zpool/zfs commands simply fail and get kicked out. zpool export
in particular will now either exit early if non-hardforced (-F
), or will be able to forcibly drop all context and export the pool. Whereas before, non -F
would hang forever while holding the namespace lock, blocking everything that depends on it.
There may be other cases that hold the namespace lock and get stuck on a suspension, but they can be fixed in a follow-up.
@wci @allanjude sorry about the delay in getting back to this. Thanks for addressing my initial feedback. I should be able to further dig in to the PR this week and get you some additional comments and testing.
As part of that work I took the liberty of getting this PR rebased on the latest master, resolving the relevant conflicts, and squashing a few commits. I appreciate how you originally kept the commits separate for the benefit of the reviewers. But keeping the individual review fixes separate at this point I'm not sure is really helpful. I didn't make any changes to the real functionality. You can find the changes in my forced-export-squashed branch. The squashed branch looks like this:
48f1e2429 zfs: support force exporting pools a107a70c1 zpool_export: make test suite more usable d07a31049 check_prop_source: improve output clarity 07f9ca481 zfs_receive: remove duplicate check_prop_source 9d64ea495 logapi: cat output file instead of printing 6e659d803 spa_export_common: refactor common exit points
For next steps what I'd suggest is:
-
Open a new PRs for each bug fix which isn't really core to this feature. There's no reason we can't get those changes integrated right away as long as they pass testing and are entirely independent. This would be a107a70c1, d07a31049, 07f9ca481, 9d64ea495, and 6e659d803. If there are other unrelated hunks we should pull out in to their own PRs that'd be nice too.
-
Please review what I've done and go ahead and force update this PR with 48f1e2429 (or a modified version of it you're happy with). This way we can get a fresh run from the CI and see where we stand. It looks like the current version did encounter at least one panic with a full ZTS run. That new commit passes the new test cases locally for me on Linux and
sanity.run
, but so far that's all I've tried.
I just pushed an update that includes @behlendorf suggested break-out, and responds to most of the comments. The earlier commits are now in these PRs: #11514 #11515 #11516 #11517 #11518.
OK, #11514 #11515 #11516 #11517 #11518 have all been merged and can be dropped from this PR with a rebase.
~@wci~ @wca when you get a chance to rebase this that would be great.
Just took an initial look at this. How does it interact with force unmounting of filesystems? Does it mean that Linux can now support
zfs unmount -f
, forcing the unmount even if there are fd's open in the filesystem?
On Linux, it's limited in the extent to which they can be forcibly exported, which is why I tried to explain the limit regarding file descriptors held open by processes. On FreeBSD, however, VFS will disassociate file descriptors from the filesystem, so it can work in that scenario. I believe this also applies to illumos, but I haven't tested there.
@wca since this change modifies the
libzfs
library ABI you're going to need to generate a newlib/libzfs/libzfs.abi
file and include it in this PR. We added this infrastructure to make sure we never accidentally change the ABI and it's why the style CI bots failed.The good news is it's straight forward to do. On Linux (Debian and Ubuntu) you just need to install the
abigail-tools
package, and then invoke thestoreabi
Makefile target inlib/libzfs/
directory. That will generate a new
libzfs.abi` file you can add to the commit. For the benefit of any reviewers can you please also mention in the commit message or PR description which interfaces have changed.cd lib/libzfs/ make storeabi
Thanks for the info, I never handled this previously. The libzfs.abi delta seems pretty big, is that normal?
Looking over the test results, I see that everything pretty much passed with the exception of these two persistent ARC tests. According to the logs it seems that after an export/import cycle there weren't any cache hits when normally there should have been. That would make sense to me if the pool has been force exported, but neither of these pools use the force flag. It seems like we should be waiting for some inflight l2arc writes in the normal export case and we're not.
FAIL l2arc/persist_l2arc_004_pos (expected PASS) FAIL l2arc/persist_l2arc_005_pos (expected PASS)
I've looked into these failures for a while, and I'm still not sure why they occur. I agree that the results appear to show no cache hits after an export/import cycle, when there should be some.
The only thing directly related that's modified in this PR is the addition of l2arc_spa_rebuild_stop
, which I think is needed anyway (ie, we could probably factor it out of this PR).
But these tests still fail even if I change the three 'exiting' functions to return B_FALSE, and remove that call in spa_export_common
. But all of the behavior modifications can only occur when hardforce=B_TRUE
is passed to spa_export_common
, or when a filesystem is force unmounted, neither of which applies to these tests...
Thanks for the info, I never handled this previously. The libzfs.abi delta seems pretty big, is that normal?
I'm still pretty new to it myself. But from what I've seen so far the answer is "yes". But that's not really a problem since it's a generated file and we only update it when we're knowingly making ABI changes.
I've looked into these failures for a while, and I'm still not sure why they occur. I agree that the results appear to show no cache hits after an export/import cycle, when there should be some.
Oddly I wasn't able to reproduce these locally in a VM so it may be some kind of race. I'll give it another try latter. But if you're able to easy reproduce it I'd suggest taking a look at the l2_rebuild_*
arcstats to see how much data was rebuilt. Maybe the new l2arc_spa_rebuild_stop
call in spa_export_common
is responsible and we're no writing the log blocks. It looks to me like it is called regardless of if we're passing hardforce
option or not.
@wca would you mind rebasing this again on the latest master. There's a minor conflicts to resolve and you'll need to generate a new libzfs.abi
file, the recent "compatibility" feature PR also changed the ABI a little. PR 11468. I'll see if I can reproduce the test failures, if we can get them resolved we can get this PR wrapped up and merged.
@wca it looks like there's still at least one path in the force export code which needs to be updated. This was caught by the CI
http://build.zfsonlinux.org/builders/CentOS%208%20x86_64%20%28TEST%29/builds/2902/steps/shell_4/logs/console
@behlendorf I'm not sure what to make of the results from the last run. I'm unable to reproduce the 1-2 ZTS test failures that only happen on the Ubuntu 18.04 test or occasionally on Ubuntu 20.04. Please advise.
@wca it looks to me like we're now just seeing a failure in the checkpoint_lun_expsz
test case. I'll see if I can reproduce this to debug it, it looks like most of the Linux builders hit this so I'm optimistic it'll be reproducible.
would this fix the following case?
pool 1 working pool 2 on external usb hdd
power fails on usb hdd
pool 2 enters suspended state
processes enter and remain in D mode (mlocate
, smbd
)
all zpool commands go hang in D state, including a simple zpool status
would this fix the following case?
pool 1 working pool 2 on external usb hdd
power fails on usb hdd
pool 2 enters suspended state processes enter and remain in D mode (
mlocate
,smbd
) all zpool commands go hang in D state, including a simplezpool status
Yes, that is the type of case this is meant to resolve
Anyone have any insight into these CentOS 8 test failures?
Thanks for rebasing this. The failures look unrelated to me, I've gone ahead and resubmitted the build on those bots to see if it's reproducible. If you get a chance it looks like their are a few man doc warnings to fix we're now catching with the stricter CI checks.
Is this a known test failure?
Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_create/zpool_create_features_009_pos (run as root) [00:23] [FAIL]
23:57:21.47 ASSERTION: verify '-o compatibility' in MOS object and cache file
23:57:21.69 SUCCESS: zpool create -f -o cachefile=/mnt/cachefile.816959 -o compatibility=legacy testpool loop0 loop1 loop2
23:57:21.71 compatibility: 'legacy'
23:57:23.31 compatibility: 'legacy'
23:57:23.31 SUCCESS: check_config legacy
23:57:43.93 SUCCESS: zpool export -F testpool
23:57:44.09 SUCCESS: zpool import -c /mnt/cachefile.816959 testpool
23:57:44.12 compatibility: 'legacy'
23:57:44.72 compatibility: 'legacy'
23:57:44.72 SUCCESS: check_config legacy
23:57:44.79 SUCCESS: zpool destroy -f testpool
23:57:44.99 SUCCESS: zpool create -f -o cachefile=/mnt/cachefile.816959 testpool loop0 loop1 loop2
23:57:45.01 SUCCESS: zpool set compatibility=legacy testpool
23:57:45.05 compatibility property missing in cache file
@allanjude it's a recently added test case, so we have less mileage on it, but I don't recall seeing it fail in the CI before. See 88a4833039b4a3f08139c5b69a2300424fddfd0f.
@allanjude it's a recently added test case, so we have less mileage on it, but I don't recall seeing it fail in the CI before. See 88a4833.
Does is make sense for it to use spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE);
there? A zpool set
is supposed to be synchronous. Otherwise, I think the test will need to add a zpool sync $poolname
to ensure that the cachefile has actually been written before testing for it.
Please say this PR isn't abandoned. ZFS locking down all zpools on an entire system due to a single pool being suspended should be the most critical problem to fix before anything else is even considered, and yet this showstopper of a design flaw has been largely ignored for many years now. This PR is badly needed. I apologize if I sound spoiled and whiny, but ZFS is the best fs ever, it's just ... well, unusable without this PR. I can't reboot my system every time a drive drops out of service. Not that I don't want to, I can't. Again, sorry for the tears and bitching, please don't let this PR die!
Please say this PR isn't abandoned. ZFS locking down all zpools on an entire system due to a single pool being suspended should be the most critical problem to fix before anything else is even considered, and yet this showstopper of a design flaw has been largely ignored for many years now. This PR is badly needed. I apologize if I sound spoiled and whiny, but ZFS is the best fs ever, it's just ... well, unusable without this PR. I can't reboot my system every time a drive drops out of service. Not that I don't want to, I can't. Again, sorry for the tears and bitching, please don't let this PR die!
We are still working on it. Hopefully we'll get the feedback addressed soon.
@allanjude Great to hear, thank you so much ❤️
@allanjude Great to hear, thank you so much :heart:
+1, this feature is really needed for my use case.
@behlendorf Please, everyone badly need this! ;)
@allanjude @behlendorf Indeed, please don't let this PR die. It's really needed a lot.
FYI, I figured out an awfully hacky workaround for my backup setup where a ZFS pool sits on top of LUKS encryption on top of a USB SSD, which avoids needing to reboot the machine whenever a ZFS pool fails to SUSPENDED state because USB occasionally temporarily disconnects or I hotplug the USB cable: The workaround is to run cryptsetup luksClose
followed by zpool clear
followed by cryptsetup luksOpen
followed by zpool scrub
which will return the SUSPENDED pool back to "healthy" state. I run this script every 60 seconds in a cron job to "auto repair" the pool if necessary, like so:
zhotplug-repair.txt
I wish there'd be a better way!
@behlendorf is there anything @wca needs to do to complete this? where is it stuck?
I'd love to see this work completed. The next steps needed to make that happen would be to address the outstanding review feedback, rebase the PR on the latest master source, and resolve any new failures reported by the test suite.