zfs icon indicating copy to clipboard operation
zfs copied to clipboard

vdev_open: clear async fault flag after reopen

Open robn opened this issue 1 year ago • 1 comments

Motivation and Context

After #15839, vdev_fault_wanted is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev.

In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However, vdev_fault_wanted is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again.

Description

The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well!

How Has This Been Tested?

Test case is included. It fails before, and now passes.

Types of changes

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [x] I have added tests to cover my changes.
  • [ ] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

robn avatar Jun 11 '24 11:06 robn

If this can handle the transient USB faults on my USB 3.1 Gen 2 drive cages causing pools to go offline until reboot...

satmandu avatar Jun 13 '24 12:06 satmandu

Further testing shows the bug's impact is a little wider: if multiple disks are lost on the same txg causing the pool to suspend, after return they will all re-fault at end of txg, and the pool will fail again. This happens if a disk array or backplane fails, taking out multiple disks in the same moment. Not a hugely big deal, and the fix here takes care of it in the same way.

robn avatar Jul 17 '24 00:07 robn

Merged as 393b7ad6952217a7c0823f705f5b4a41d6b4f3f5 5de3ac223623d5348e491cc89c70a803ddcd7184

tonyhutter avatar Jul 17 '24 17:07 tonyhutter