zfs vdev probe to slow disk can stall mmp write checker

Motivation and Context

MMP Updates When MMP is enabled on a pool, the mmp_thread() will update uberblocks for the leaf VDEVs at a regular cadence. It will also check that these updates were completed within the appropriate time and if not, it will suspend the pool. These uberblock updates occur at a rate that ensures that even if one disk was slow to respond, a subsequent update to a different disk will complete in time. My initial investigation had one drive with response times greater than the write check threshold but that slow update is effectively ignored and does not trigger a pool suspend.

Before choosing a VDEV to update, the mmp_write_uberblock() code will acquire a reader lock on the spa config. Here the mmp thread is assuming that the acquisition of this lock never takes too long, such that it won't be able to complete the uberblock writes at a regular cadence (on my config this threshold was < 10 seconds).

VDEV Probes A vdev probe issues some read and write requests to the pad area of the zfs labels and in the associated zio done will post a zevent for FM_EREPORT_ZFS_PROBE_FAILURE if it was not to successfully complete at least one read and one write. The caller of vdev_probe() can take action, and in the case of a vdev_open(), it will change the vdev state to faulted if the probe returns an error (ENXIO).

When an unexpected error is encounter during zio_vdev_io_done(), it will initiate a vdev_probe() operation. The vdev_probe() function will probe the disk and additionally request a spa_async_request(spa, SPA_ASYNC_PROBE), which will perform a future vdev_reopen(), which will call vdev_probe() again, but this time take action if the probe fails by faulting the disk.

This spa_async_request acquires the spa config lock as a writer during the entirety of the vdev_reopen() which includes a second set of probe io (3 reads, 3 writes). And when a slow disk is involved, this means that the spa config lock can be held as writer for a very long duration.

In summary, a single disk that is throwing errors and also has long response times (multiple seconds) will cause a MMP induced pool suspension due to a long vdev probe operation.

Description

The crux of the problem is that a vdev_probe() that is triggered by unexpected errors, holds the spa config lock across multiple issued IOs to a suspect disk.

One solution would be to downgrade the lock when waiting for the issued io to complete. Given the complexity of the spa config lock, this seems like it would present some challenges.

Another solution is to avoid calling vdev_probe() twice. Instead, if the initial probe IO did not successfully read or write, then take the same action seen in vdev_open() and change the vdev state to faulted. We can use a spa_async_request to make the change in syncing context (like we were doing with the vdev_reopen()).

The net change here is that we are not reopening the device for the probe but strictly performing the same IO tests and then if we notice a failure use a spa_async_request to change the vdev state.

Question for reviewers Is a reopen performing some other benefit aside from the probing IO? Like is the VDEV label validation or other aspects of an open required or beneficial some how?

How Has This Been Tested?

Added a new ZTS test functional/mmp/mmp_write_slow_disk that will induce disk errors that are also slow to respond and trigger a mmp pool suspend for an unpatched zfs. With the fix in place the suspend no longer occurs.

Also ran existing functional/mmp tests

Manually confirmed that a vdev_probe() from zio_vdev_io_done() will continue to fault the vdev if the probe IOs fail.

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[x] I have read the contributing document.
[x] I have added tests to cover my changes.
[ ] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

Jan 30 '24 04:01 don-brady

Rebased to address merge conflicts

Feb 08 '24 19:02 don-brady

Allow zpool clear for a mmp suspended pool if there is no evidence of remote host activity

Feb 14 '24 22:02 don-brady

Hi @don I'm confused by the github UI. Do you need my review now, or are you reworking some part of the PR? thanks

From: Don Brady @.***> Sent: Monday, March 4, 2024 8:27 AM To: openzfs/zfs Cc: Faaland, Olaf P.; Review requested Subject: Re: [openzfs/zfs] vdev probe to slow disk can stall mmp write checker (PR #15839)

@don-bradyhttps://urldefense.us/v3/__https://github.com/don-brady__;!!G2kpM7uM-TzIFchu!0eV-GKofLalHL5pVZmYa0ZcM6HURLEUDfFCysjqjJ_jLu6WjvF3e-GlW-ON4gOmkp7ysh3P__wM9XfUDKxrdfNChZQ$ requested your review on: #15839https://urldefense.us/v3/__https://github.com/openzfs/zfs/pull/15839__;!!G2kpM7uM-TzIFchu!0eV-GKofLalHL5pVZmYa0ZcM6HURLEUDfFCysjqjJ_jLu6WjvF3e-GlW-ON4gOmkp7ysh3P__wM9XfUDKxq096pxMQ$ vdev probe to slow disk can stall mmp write checker.

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/openzfs/zfs/pull/15839*event-12001683833__;Iw!!G2kpM7uM-TzIFchu!0eV-GKofLalHL5pVZmYa0ZcM6HURLEUDfFCysjqjJ_jLu6WjvF3e-GlW-ON4gOmkp7ysh3P__wM9XfUDKxo_QISdpg$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AB73C4YC7EJ4HS3CJUC2LILYWSOHNAVCNFSM6AAAAABCQSJFKGVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGAYDCNRYGM4DGMY__;!!G2kpM7uM-TzIFchu!0eV-GKofLalHL5pVZmYa0ZcM6HURLEUDfFCysjqjJ_jLu6WjvF3e-GlW-ON4gOmkp7ysh3P__wM9XfUDKxrKot-gqQ$. You are receiving this because your review was requested.Message ID: @.***>

Mar 04 '24 17:03 ofaaland

@ofaaland I was looking for a review of the latest changes. In particular the zpool clear changes. Thanks

Mar 05 '24 04:03 don-brady

@ofaaland When do you think you'll have time to review this?

Apr 02 '24 18:04 allanjude

@ofaaland When do you think you'll have time to review this?

Hi @allanjude I'll review it tomorrow. Sorry for the long delay.

Apr 05 '24 00:04 ofaaland

I think I just ran into this in #16078 on a large HDD zpool, so it would be fantastic if this can make it into a 2.2.4 release.

Note, this occurred shortly after upgrading from 2.1.15 to 2.2.3: is this more likely to happen with 2.2.x than 2.1.x, or was I just unlucky?

Apr 11 '24 01:04 stuartthebruce

@behlendorf I'll take another look at re-enabling after suspend path

Apr 18 '24 21:04 don-brady

I was able to clear (un-suspend) a pool (raidz2, 8 disks) that was suspended from mmp write checks . There was no other host involved. I Would like to understand Brian's failure case.

Apr 19 '24 04:04 don-brady

zfs zfs copied to clipboard

vdev probe to slow disk can stall mmp write checker

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

zfs
zfs copied to clipboard