zfs vdev_open: clear async fault flag after reopen

Motivation and Context

After #15839, vdev_fault_wanted is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev.

In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However, vdev_fault_wanted is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again.

Description

The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well!

How Has This Been Tested?

Test case is included. It fails before, and now passes.

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[x] I have read the contributing document.
[x] I have added tests to cover my changes.
[ ] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

Jun 11 '24 11:06 robn

If this can handle the transient USB faults on my USB 3.1 Gen 2 drive cages causing pools to go offline until reboot...

Jun 13 '24 12:06 satmandu

Further testing shows the bug's impact is a little wider: if multiple disks are lost on the same txg causing the pool to suspend, after return they will all re-fault at end of txg, and the pool will fail again. This happens if a disk array or backplane fails, taking out multiple disks in the same moment. Not a hugely big deal, and the fix here takes care of it in the same way.

Jul 17 '24 00:07 robn

Merged as 393b7ad6952217a7c0823f705f5b4a41d6b4f3f5 5de3ac223623d5348e491cc89c70a803ddcd7184

Jul 17 '24 17:07 tonyhutter