zfs resilver_defer should have a threshold for triggering

Describe the feature would like to see added to OpenZFS

Sometimes you want to replace multiple disks in a pool at once, so you do zpool replace foo disk1 disk2; zpool replace foo disk3 disk4; but then it queues up the second one to run after the first, even though restarting would be almost free.

It'd be nice if it could know that it should just restart if it's not very far into it, for some people.

How will this feature improve OpenZFS?

Not surprising people with doing multiple resilvers when it wasn't actually saving them much work by deferring it.

Additional context

N/A

Feb 17 '23 18:02 rincebrain

Would a flag to zpool replace - that aborts a currently running resilver, applies all queued replacements to the pool structure and starts a new resilver - be the way to go?

That would be less magic, as the administrator would be in control of this decision - as it IMHO should be. It also would spare us spending time on what metric could be reasonable for this kind of decision.

Mar 08 '23 06:03 GregorKopka

I mean, just zpool resilver accomplishes that today. The point here is that the resilver_defer feature was intended to ensure you don't accidentally wind up spending significantly longer with fully reduced redundancy just because a second disk needed replacement, and its implementation happens to cause surprising behaviors if you want to kick off multiple replacements at once and haven't already encountered this, and so don't know to do that by hand.

I also don't think people are going to get stuck on picking the precise optimal setting over some conservative default and iterating, here.

Mar 08 '23 08:03 rincebrain

If that is the case, which I likely missed as splitting up the man files did not really help me with reading up on what is new:

What is the point of this feature request, if there is already a way to trigger the desired behavior?

Mar 08 '23 08:03 GregorKopka

The desired behavior change is to not require users to know this is necessary to do manually if they hit a common case of triggering multiple replaces at once in short succession, e.g. very little would be lost restarting it immediately, and you would often finish markedly faster unless you have uncommon bottlenecks.

Mar 08 '23 09:03 rincebrain

Maybe adding a status output to zfs replace that states something like

Replace operation is postponed until currently running resilver completes.
Should you want this replace to happen immediately issue a `zpool resilver`.

when a defer happens?

Mar 08 '23 13:03 GregorKopka

The goal is to avoid people having to manually run something in a common scenario that used to work and was broken by a "feature". Just telling people the workaround every time is not a solution to that problem, and also could break various scripts that assume getting output from zpool replace means an error happened.

Mar 08 '23 14:03 rincebrain

A version of this was also suggested as a possible improvement in deferred resilver PR. As you might expect there's no great answer for what the optimal thing to do is in all cases, it's complicated as laid out in this https://github.com/openzfs/zfs/pull/7732#issuecomment-422113740. That said, I don't see why we couldn't better handle the common case mentioned here where you're replacing drives in fairly rapid succession. Something as simple as allow the resilver to restart if it's been running for less than 10 minutes and is say <50% complete would probably go a long ways.

Mar 08 '23 18:03 behlendorf

related discussion: https://github.com/openzfs/zfs/discussions/14731

it would be nice to have some kind of indication, which resilvers are actually running and which are deferred. zpool status gives no indication about the actual status and current documentation is misleading: https://openzfs.github.io/openzfs-docs/man/8/zpool-resilver.8.html#DESCRIPTION Starts a resilver of the specified pools. If an existing resilver is already running it will be restarted from the beginning. Any drives that were scheduled for a deferred resilver will be added to the new one. This requires the resilver_defer pool feature.

scripts# zpool replace Pool_A /dev/gptid/4c7b7166-6c57-11e9-86e0-001517321a31 ada0;
scripts# date;
Fri Apr  7 23:32:15 CEST 2023
scripts# zpool replace Pool_A /dev/gptid/527876b7-6c57-11e9-86e0-001517321a31 ada2;
scripts# date;
Fri Apr  7 23:32:37 CEST 2023

...

scripts# date ; zpool status Pool_A ;
Sun Apr  9 09:25:45 CEST 2023
  pool: Pool_A
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr  7 23:32:06 2023
        16.8T scanned at 144M/s, 16.6T issued at 143M/s, 21.0T total
        5.54T resilvered, 79.12% done, 08:56:34 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        Pool_A                                          DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/3cc42dae-6c57-11e9-86e0-001517321a31  ONLINE       0     0     0
            gptid/450fc446-6c57-11e9-86e0-001517321a31  ONLINE       0     0     0
            replacing-2                                 DEGRADED     0     0     0
              8552346261489016874                       UNAVAIL      0     0     0  was /dev/gptid/4c7b7166-6c57-11e9-86e0-001517321a31
              ada0                                      ONLINE       0     0     0  (resilvering)
            replacing-3                                 DEGRADED     0     0     0
              15661576951159810779                      UNAVAIL      0     0     0  was /dev/gptid/527876b7-6c57-11e9-86e0-001517321a31
              ada2                                      ONLINE       0     0     0  (resilvering)
            ada8                                        ONLINE       0     0     0
            ada10                                       ONLINE       0     0     0

errors: No known data errors
...

scripts# date ; zpool status Pool_A ;
Mon Apr 10 19:46:25 CEST 2023
  pool: Pool_A
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Apr  9 18:21:58 2023
        10.9T scanned at 125M/s, 10.5T issued at 120M/s, 21.0T total
        1.75T resilvered, 50.02% done, 1 days 01:23:08 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        Pool_A                                          DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/3cc42dae-6c57-11e9-86e0-001517321a31  ONLINE       0     0     0
            gptid/450fc446-6c57-11e9-86e0-001517321a31  ONLINE       0     0     0
            ada0                                        ONLINE       0     0     0
            replacing-3                                 DEGRADED     0     0     0
              15661576951159810779                      UNAVAIL      0     0     0  was /dev/gptid/527876b7-6c57-11e9-86e0-001517321a31
              ada2                                      ONLINE       0     0     0  (resilvering)
            ada8                                        ONLINE       0     0     0
            ada10                                       ONLINE       0     0     0

errors: No known data errors

Apr 10 '23 17:04 abaumgaertner

zfs zfs copied to clipboard

resilver_defer should have a threshold for triggering

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?

Additional context

zfs
zfs copied to clipboard