zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Support forced export of ZFS pools

Open wca opened this issue 4 years ago • 59 comments

Motivation and Context

This change enables users to forcibly export a pool in a manner which safely discards all pending dirty data, transaction groups, associated zios, metaslab changes, etc. It allows a regular zpool export to fail out so that it releases the namespace lock to enable a forced export. It is able to do this regardless of whether the current pool activity involves send, receive, or POSIX I/O.

This allows users to continue using their system without rebooting, in the event the disk cannot be recovered in an online manner, or the user prefers to resume use of their pool at a later time. Since ZFS can simply resume from the most recent consistent transaction group, the latter is easily achieved.

Closes #3461

Description

This is primarily of use when a pool has lost its disk, while the user doesn't care about any pending (or otherwise) transactions.

Implement various control methods to make this feasible:

  • txg_wait can now take a NOSUSPEND flag, in which case the caller will be alerted if their txg can't be committed. This is primarily of interest for callers that would normally pass TXG_WAIT, but don't want to wait if the pool becomes suspended, which allows unwinding in some cases, specifically when one is attempting a non-forced export. Without this, the non-forced export would preclude a forced export by virtue of holding the namespace lock indefinitely.
  • txg_wait also returns failure for TXG_WAIT users if a pool is actually being force exported. Adjust most callers to tolerate this.
  • spa_config_enter_flags now takes a NOSUSPEND flag to the same effect.
  • DMU objset "killer" flag which may be set on an objset being forcibly exported / unmounted.
  • SPA "killer" flag which may be set on a pool being forcibly exported.
  • DMU send/recv now use an interruption mechanism which relies on the SPA killer being able to enumerate datasets and closing any send/recv streams, causing their EINTR paths to be invoked.
  • ZIO now has a cancel entry point, which tells all suspended zios to fail, and which suppresses the failures for non-CANFAIL users.
  • metaslab, etc. cleanup, which consists of simply throwing away any changes that were not able to be synced out.
  • Linux specific: introduce a new tunable, zfs_forced_export_unmount_enabled, which allows the filesystem to remain in a modified 'unmounted' state upon exiting zpl_umount_begin, to achieve parity with FreeBSD and illumos, which have VFS-level support for yanking filesystems out from under users. However, this only helps when the user is actively performing I/O, while not sitting on the filesystem. In particular, this allows test #3 below to pass on Linux.
  • Add basic logic to zpool to indicate a force-exporting pool, instead of crashing due to lack of config, etc.

Add tests which cover the basic use cases:

  • Force export while a send is in progress
  • Force export while a recv is in progress
  • Force export while POSIX I/O is in progress

How Has This Been Tested?

  • New ZFS Test Suite tests covering the three main pool activity scenarios.
  • Testing in a production environment that focuses around use of send/recv to a network-based disk, which can fail and not return "for a while". User doesn't mind losing the last few transaction groups, and is able to pick up later.
  • Existing ZFS Test Suite tests, to check for any unexpected breakage of non-forced-export scenarios.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [x] I have added tests to cover my changes.
  • [x] I have run the ZFS Test Suite with this change applied.
  • [ ] All commit messages are properly formatted and contain Signed-off-by.

Pending items on this checklist are pending review. I am seeking feedback on whether to break out any of the commits, rather than squash all of them into the primary one.

wca avatar Oct 17 '20 16:10 wca

Does this address the various zpool/zfs commands blocking when the pool is in distress? i.e. will this command succeed until such conditions or get blocked

gdevenyi avatar Oct 17 '20 16:10 gdevenyi

Codecov Report

Patch coverage: 62.00% and project coverage change: +4.49 :tada:

Comparison is base (161ed82) 75.17% compared to head (f168cb6) 79.66%.

:exclamation: Current head f168cb6 differs from pull request most recent head 0c5b2fa. Consider uploading reports for the commit 0c5b2fa to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11082      +/-   ##
==========================================
+ Coverage   75.17%   79.66%   +4.49%     
==========================================
  Files         402      398       -4     
  Lines      128071   126235    -1836     
==========================================
+ Hits        96283   100571    +4288     
+ Misses      31788    25664    -6124     
Flag Coverage Δ
kernel 80.24% <61.60%> (+1.48%) :arrow_up:
user 65.35% <40.72%> (+17.93%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
include/sys/dmu.h 100.00% <ø> (ø)
include/sys/dsl_dataset.h 100.00% <ø> (ø)
module/zfs/vdev_initialize.c 98.10% <ø> (+4.58%) :arrow_up:
module/zfs/vdev_label.c 92.22% <0.00%> (+1.35%) :arrow_up:
module/zfs/metaslab.c 94.52% <12.50%> (+1.91%) :arrow_up:
module/zfs/dsl_dataset.c 90.82% <18.75%> (-0.44%) :arrow_down:
module/zfs/vdev_rebuild.c 90.75% <31.57%> (+13.09%) :arrow_up:
module/zfs/dmu_send.c 84.42% <33.33%> (+5.04%) :arrow_up:
module/zfs/vdev_removal.c 96.50% <33.33%> (+3.79%) :arrow_up:
module/zfs/vdev_trim.c 93.50% <38.46%> (+17.50%) :arrow_up:
... and 33 more

... and 194 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Oct 17 '20 22:10 codecov[bot]

Ran into #9067 on the last full ZTS run I did locally, but it looks better now.

wca avatar Oct 26 '20 03:10 wca

Does this address the various zpool/zfs commands blocking when the pool is in distress? i.e. will this command succeed until such conditions or get blocked

It depends on the specific issue. Most of the time, if a pool is suspended, zpool/zfs commands simply fail and get kicked out. zpool export in particular will now either exit early if non-hardforced (-F), or will be able to forcibly drop all context and export the pool. Whereas before, non -F would hang forever while holding the namespace lock, blocking everything that depends on it.

There may be other cases that hold the namespace lock and get stuck on a suspension, but they can be fixed in a follow-up.

wca avatar Oct 26 '20 03:10 wca

@wci @allanjude sorry about the delay in getting back to this. Thanks for addressing my initial feedback. I should be able to further dig in to the PR this week and get you some additional comments and testing.

As part of that work I took the liberty of getting this PR rebased on the latest master, resolving the relevant conflicts, and squashing a few commits. I appreciate how you originally kept the commits separate for the benefit of the reviewers. But keeping the individual review fixes separate at this point I'm not sure is really helpful. I didn't make any changes to the real functionality. You can find the changes in my forced-export-squashed branch. The squashed branch looks like this:

48f1e2429 zfs: support force exporting pools a107a70c1 zpool_export: make test suite more usable d07a31049 check_prop_source: improve output clarity 07f9ca481 zfs_receive: remove duplicate check_prop_source 9d64ea495 logapi: cat output file instead of printing 6e659d803 spa_export_common: refactor common exit points

For next steps what I'd suggest is:

  1. Open a new PRs for each bug fix which isn't really core to this feature. There's no reason we can't get those changes integrated right away as long as they pass testing and are entirely independent. This would be a107a70c1, d07a31049, 07f9ca481, 9d64ea495, and 6e659d803. If there are other unrelated hunks we should pull out in to their own PRs that'd be nice too.

  2. Please review what I've done and go ahead and force update this PR with 48f1e2429 (or a modified version of it you're happy with). This way we can get a fresh run from the CI and see where we stand. It looks like the current version did encounter at least one panic with a full ZTS run. That new commit passes the new test cases locally for me on Linux and sanity.run, but so far that's all I've tried.

behlendorf avatar Jan 12 '21 01:01 behlendorf

I just pushed an update that includes @behlendorf suggested break-out, and responds to most of the comments. The earlier commits are now in these PRs: #11514 #11515 #11516 #11517 #11518.

wca avatar Jan 25 '21 00:01 wca

OK, #11514 #11515 #11516 #11517 #11518 have all been merged and can be dropped from this PR with a rebase.

behlendorf avatar Jan 26 '21 21:01 behlendorf

~@wci~ @wca when you get a chance to rebase this that would be great.

behlendorf avatar Feb 02 '21 17:02 behlendorf

Just took an initial look at this. How does it interact with force unmounting of filesystems? Does it mean that Linux can now support zfs unmount -f, forcing the unmount even if there are fd's open in the filesystem?

On Linux, it's limited in the extent to which they can be forcibly exported, which is why I tried to explain the limit regarding file descriptors held open by processes. On FreeBSD, however, VFS will disassociate file descriptors from the filesystem, so it can work in that scenario. I believe this also applies to illumos, but I haven't tested there.

wca avatar Feb 06 '21 17:02 wca

@wca since this change modifies the libzfs library ABI you're going to need to generate a new lib/libzfs/libzfs.abi file and include it in this PR. We added this infrastructure to make sure we never accidentally change the ABI and it's why the style CI bots failed.

The good news is it's straight forward to do. On Linux (Debian and Ubuntu) you just need to install the abigail-tools package, and then invoke the storeabi Makefile target in lib/libzfs/directory. That will generate a newlibzfs.abi` file you can add to the commit. For the benefit of any reviewers can you please also mention in the commit message or PR description which interfaces have changed.

cd lib/libzfs/
make storeabi

Thanks for the info, I never handled this previously. The libzfs.abi delta seems pretty big, is that normal?

Looking over the test results, I see that everything pretty much passed with the exception of these two persistent ARC tests. According to the logs it seems that after an export/import cycle there weren't any cache hits when normally there should have been. That would make sense to me if the pool has been force exported, but neither of these pools use the force flag. It seems like we should be waiting for some inflight l2arc writes in the normal export case and we're not.

    FAIL l2arc/persist_l2arc_004_pos (expected PASS)
    FAIL l2arc/persist_l2arc_005_pos (expected PASS)

I've looked into these failures for a while, and I'm still not sure why they occur. I agree that the results appear to show no cache hits after an export/import cycle, when there should be some.

The only thing directly related that's modified in this PR is the addition of l2arc_spa_rebuild_stop, which I think is needed anyway (ie, we could probably factor it out of this PR).

But these tests still fail even if I change the three 'exiting' functions to return B_FALSE, and remove that call in spa_export_common. But all of the behavior modifications can only occur when hardforce=B_TRUE is passed to spa_export_common, or when a filesystem is force unmounted, neither of which applies to these tests...

wca avatar Feb 15 '21 04:02 wca

Thanks for the info, I never handled this previously. The libzfs.abi delta seems pretty big, is that normal?

I'm still pretty new to it myself. But from what I've seen so far the answer is "yes". But that's not really a problem since it's a generated file and we only update it when we're knowingly making ABI changes.

I've looked into these failures for a while, and I'm still not sure why they occur. I agree that the results appear to show no cache hits after an export/import cycle, when there should be some.

Oddly I wasn't able to reproduce these locally in a VM so it may be some kind of race. I'll give it another try latter. But if you're able to easy reproduce it I'd suggest taking a look at the l2_rebuild_* arcstats to see how much data was rebuilt. Maybe the new l2arc_spa_rebuild_stop call in spa_export_common is responsible and we're no writing the log blocks. It looks to me like it is called regardless of if we're passing hardforce option or not.

behlendorf avatar Feb 16 '21 00:02 behlendorf

@wca would you mind rebasing this again on the latest master. There's a minor conflicts to resolve and you'll need to generate a new libzfs.abi file, the recent "compatibility" feature PR also changed the ABI a little. PR 11468. I'll see if I can reproduce the test failures, if we can get them resolved we can get this PR wrapped up and merged.

behlendorf avatar Feb 19 '21 00:02 behlendorf

@wca it looks like there's still at least one path in the force export code which needs to be updated. This was caught by the CI

http://build.zfsonlinux.org/builders/CentOS%208%20x86_64%20%28TEST%29/builds/2902/steps/shell_4/logs/console

behlendorf avatar Feb 21 '21 22:02 behlendorf

@behlendorf I'm not sure what to make of the results from the last run. I'm unable to reproduce the 1-2 ZTS test failures that only happen on the Ubuntu 18.04 test or occasionally on Ubuntu 20.04. Please advise.

wca avatar Mar 09 '21 02:03 wca

@wca it looks to me like we're now just seeing a failure in the checkpoint_lun_expsz test case. I'll see if I can reproduce this to debug it, it looks like most of the Linux builders hit this so I'm optimistic it'll be reproducible.

behlendorf avatar Mar 09 '21 05:03 behlendorf

would this fix the following case?

pool 1 working pool 2 on external usb hdd

power fails on usb hdd

pool 2 enters suspended state processes enter and remain in D mode (mlocate, smbd) all zpool commands go hang in D state, including a simple zpool status

mailinglists35 avatar May 10 '21 09:05 mailinglists35

would this fix the following case?

pool 1 working pool 2 on external usb hdd

power fails on usb hdd

pool 2 enters suspended state processes enter and remain in D mode (mlocate, smbd) all zpool commands go hang in D state, including a simple zpool status

Yes, that is the type of case this is meant to resolve

allanjude avatar May 20 '21 15:05 allanjude

Anyone have any insight into these CentOS 8 test failures?

allanjude avatar Jun 21 '21 18:06 allanjude

Thanks for rebasing this. The failures look unrelated to me, I've gone ahead and resubmitted the build on those bots to see if it's reproducible. If you get a chance it looks like their are a few man doc warnings to fix we're now catching with the stricter CI checks.

behlendorf avatar Jun 22 '21 19:06 behlendorf

Is this a known test failure?

Test: /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_create/zpool_create_features_009_pos (run as root) [00:23] [FAIL]
23:57:21.47 ASSERTION: verify '-o compatibility' in MOS object and cache file
23:57:21.69 SUCCESS: zpool create -f -o cachefile=/mnt/cachefile.816959 -o compatibility=legacy testpool loop0 loop1 loop2
23:57:21.71     compatibility: 'legacy'
23:57:23.31         compatibility: 'legacy'
23:57:23.31 SUCCESS: check_config legacy
23:57:43.93 SUCCESS: zpool export -F testpool
23:57:44.09 SUCCESS: zpool import -c /mnt/cachefile.816959 testpool
23:57:44.12     compatibility: 'legacy'
23:57:44.72         compatibility: 'legacy'
23:57:44.72 SUCCESS: check_config legacy
23:57:44.79 SUCCESS: zpool destroy -f testpool
23:57:44.99 SUCCESS: zpool create -f -o cachefile=/mnt/cachefile.816959 testpool loop0 loop1 loop2
23:57:45.01 SUCCESS: zpool set compatibility=legacy testpool
23:57:45.05 compatibility property missing in cache file

allanjude avatar Jul 06 '21 21:07 allanjude

@allanjude it's a recently added test case, so we have less mileage on it, but I don't recall seeing it fail in the CI before. See 88a4833039b4a3f08139c5b69a2300424fddfd0f.

behlendorf avatar Jul 06 '21 21:07 behlendorf

@allanjude it's a recently added test case, so we have less mileage on it, but I don't recall seeing it fail in the CI before. See 88a4833.

Does is make sense for it to use spa_async_request(spa, SPA_ASYNC_CONFIG_UPDATE); there? A zpool set is supposed to be synchronous. Otherwise, I think the test will need to add a zpool sync $poolname to ensure that the cachefile has actually been written before testing for it.

allanjude avatar Jul 07 '21 14:07 allanjude

Please say this PR isn't abandoned. ZFS locking down all zpools on an entire system due to a single pool being suspended should be the most critical problem to fix before anything else is even considered, and yet this showstopper of a design flaw has been largely ignored for many years now. This PR is badly needed. I apologize if I sound spoiled and whiny, but ZFS is the best fs ever, it's just ... well, unusable without this PR. I can't reboot my system every time a drive drops out of service. Not that I don't want to, I can't. Again, sorry for the tears and bitching, please don't let this PR die!

DanielSmedegaardBuus avatar Dec 19 '21 15:12 DanielSmedegaardBuus

Please say this PR isn't abandoned. ZFS locking down all zpools on an entire system due to a single pool being suspended should be the most critical problem to fix before anything else is even considered, and yet this showstopper of a design flaw has been largely ignored for many years now. This PR is badly needed. I apologize if I sound spoiled and whiny, but ZFS is the best fs ever, it's just ... well, unusable without this PR. I can't reboot my system every time a drive drops out of service. Not that I don't want to, I can't. Again, sorry for the tears and bitching, please don't let this PR die!

We are still working on it. Hopefully we'll get the feedback addressed soon.

allanjude avatar Dec 19 '21 16:12 allanjude

@allanjude Great to hear, thank you so much ❤️

DanielSmedegaardBuus avatar Dec 19 '21 16:12 DanielSmedegaardBuus

@allanjude Great to hear, thank you so much :heart:

+1, this feature is really needed for my use case.

DurvalMenezes avatar Dec 19 '21 17:12 DurvalMenezes

@behlendorf Please, everyone badly need this! ;)

liyimeng avatar Feb 04 '22 06:02 liyimeng

@allanjude @behlendorf Indeed, please don't let this PR die. It's really needed a lot.

FYI, I figured out an awfully hacky workaround for my backup setup where a ZFS pool sits on top of LUKS encryption on top of a USB SSD, which avoids needing to reboot the machine whenever a ZFS pool fails to SUSPENDED state because USB occasionally temporarily disconnects or I hotplug the USB cable: The workaround is to run cryptsetup luksClose followed by zpool clear followed by cryptsetup luksOpen followed by zpool scrub which will return the SUSPENDED pool back to "healthy" state. I run this script every 60 seconds in a cron job to "auto repair" the pool if necessary, like so: zhotplug-repair.txt

I wish there'd be a better way!

whoschek avatar Apr 14 '22 21:04 whoschek

@behlendorf is there anything @wca needs to do to complete this? where is it stuck?

mailinglists35 avatar Aug 10 '22 13:08 mailinglists35

I'd love to see this work completed. The next steps needed to make that happen would be to address the outstanding review feedback, rebase the PR on the latest master source, and resolve any new failures reported by the test suite.

behlendorf avatar Aug 10 '22 16:08 behlendorf