zfs icon indicating copy to clipboard operation
zfs copied to clipboard

scrub resets after reboot

Open artee666 opened this issue 4 years ago • 40 comments

Distribution Name | archlinux Distribution Version | rolling Linux Kernel | 5.4.0-rc1-mainline Architecture | x86_64 ZFS Version | 0.8.2 SPL Version | 0.8.2

Describe the problem you're observing

Scrub resets its progress after reboot

Describe how to reproduce the problem

Start scrub, check the progress, reboot, check the progress again

artee666 avatar Oct 02 '19 05:10 artee666

Did you wait 2 hours after starting?

scineram avatar Oct 02 '19 17:10 scineram

did you try to pause scrubbing before reboot? in theory that should make it continue from the point where it paused.

never tried that, please report if it works

−p

Pause scrubbing. Scrub pause state and progress are periodically synced to disk. If the system is restarted or pool is exported during a paused scrub, even after import, scrub will remain paused until it is resumed. Once resumed the scrub will pick up from the place where it was last checkpointed to disk. To resume a paused scrub issue zpool scrub again.

devZer0 avatar Oct 02 '19 19:10 devZer0

I've tried to pause the scrub (at approx. 2.54 %), reboot and after resuming the scrub, it started from scratch :(

I know that on debian and zfs 0.7.9 the scrub continued after reboot without any need of manual pausing and resuming of scrub (which is not working for me anyway).

artee666 avatar Oct 02 '19 20:10 artee666

i can confirm it resumes correctly with 0.7.13

devZer0 avatar Oct 03 '19 21:10 devZer0

@devZer0 I confirm that even when pausing a scrub, a reboot causes the scrub to restart. This applies to ZFS 0.8.x when using the new sequential scrub only (legacy scrub works as expected).

shodanshok avatar Nov 28 '19 21:11 shodanshok

@artee666 Did you try to wait zfs_scan_checkpoint_intval seconds (7200 by default) before rebooting?

shodanshok avatar Nov 29 '19 07:11 shodanshok

@shodanshok I think, I have not... Will try to set this parameter to 10 minutes and see. Will report tomorrow...

artee666 avatar Nov 30 '19 22:11 artee666

@shodanshok So I've set the zfs_scan_checkpoint_intval to 600 seconds, I've waited 11 minutes, rebooted and the scrub started all over again.

Would be also nice to create such checkpoint also when properly rebooting or shuting down the computer (on zfs module unload?), because in worst case scenario 7199 seconds of scrub could be lost.

artee666 avatar Dec 01 '19 08:12 artee666

Can you test with 300 second interval and waiting 20 minutes?

scineram avatar Dec 01 '19 21:12 scineram

@scineram I've tried it and the scrub was started from 0 after reboot.

I've set the param zfs_scan_legacy to 1 and this kinda solves this issue for me.

artee666 avatar Dec 02 '19 07:12 artee666

Just to make sure everyone understands the gravity of this: it affects not only scrubs but also resilvers, see #9646

DurvalMenezes avatar Dec 02 '19 11:12 DurvalMenezes

0.8.2 zfs kernel 5.2.7-arch1-1-ARCH, same problem on sequential resilver.

Boris-Barboris avatar Jan 07 '20 18:01 Boris-Barboris

I can absolutely understand why the new sequential scrub and resilver behavior may be confusing. The code is working as intended. However, unlike the legacy scrub the sequential scrub design necessitates a tradeoff between maximizing performance and the frequency of on-disk checkpoints.

The default settings lean towards the performance end of the spectrum, which means checkpoints are relatively infrequent (about every 2 hours). This behavior is desirable for large HDD based pools which are rarely exported. It's less ideal for pools which are frequently imported/exported since a new checkpoint is not written when a pool is exported (or a scrub is paused). At a minimum we should update the scrub section of the zpool man page to better explain this.

As mentioned above setting zfs_scan_checkpoint_intval to write checkpoints more frequently may help. Though be aware this isn't a hard limit and depending on your exact pool layout and hardware it may still take significantly longer than this between checkpoints.

The heart of the issue is that for sequential scrub / resilver to write a checkpoint it must first drain the in memory scan queues it has built up. To do this IO needs to be issued for everything in the queue, depending on the size of the queue and speed of the scrub this can take a considerable amount of time (many minutes). This time is in addition to the requested zfs_scan_checkpoint_intval, which is why it's about 2 hours.

It's for this reason that the scan queues are discarded when running zpool export instead of drained. The last on-disk checkpoint is then used for import which is why you can see the overall progress regress or reset. Pausing a scrub will also not result in the queues being drained. Though that functionality could be added with a little development work.

Revisiting the default settings may also be worthwhile. It wouldn't be unreasonable for the scrub to broadly take in to consideration your pool geometry and hardware when sizing the memory queues and checkpoint frequency. For example, for an all SSD pool where maximizing sequential access isn't as important a smaller memory footpoint for the scan queues and more frequent checkpoints would make sense.

behlendorf avatar Jan 10 '20 20:01 behlendorf

@behlendorf thanks for sharing these information. However, here the user set zfs_scan_checkpoint_intval to 5 mins and waited for 20 mins, still the scrub restarted after a reboot.

Is it the expected behavior?

Thanks.

shodanshok avatar Jan 11 '20 11:01 shodanshok

i can see issue with current design: i have pool about 1PB, where one of dataset has 377T. if scan in progress for scrub or resilver - it stop to do others writes operations to dataset with big data. for example: one node was offline on one day and has been put back and scan has been started for resilver and block writes to dataset about 4h.

ikozhukhov avatar Jan 11 '20 11:01 ikozhukhov

The confusion is that pause is about stopping I/O, not about draining the queues and commiting a checkpoint. The problem with the latter is that it can take long, basically depending on average blocksize with the given memory limits. For example few days ago I scrubbed my FreeNAS with 794 GiB allocated (they ported the same code with the same default 2h zfs_scan_checkpoint_intval). Simple mirrored vdev, mostly multimedia, so 1 MiB blocks. It took 96 minutes, however the scanned amount reached the allocsize about just over 30 minutes, when about 300 GiB was issued. So basically the last hour was spent draining the queue, since the pool was otherwise idle. Even if zfs_scan_checkpoint_intval was set to anything over 35 minutes, there would have been no scrub checkpoint.

scineram avatar Jan 12 '20 10:01 scineram

That's exactly right. I believe part of the confusion here is that we don't currently provide any administrative interface to report on when the last scrub checkpoint was taken. Nor is there a nice way to bias the scrub towards more frequent checkpoints and away from maximizing performance. These are both areas where the UI could be improved.

behlendorf avatar Jan 13 '20 20:01 behlendorf

See also https://github.com/openzfs/zfs/issues/9646#issuecomment-652854973

This bites hard on USB drives which fail to resilver every time, and so makes USB mirrors unusable in this scenario:

ZFS on USB (external) drives user here.

Laptop outta space, making use of two external USB drives, the first was created for backups and offloading the internal drive, and the second I attempted to add later as a mirror of the first, but so far am unable to do so!

First USB adapter is USB3 and seems to work ok. Second is really cheap, also USB3, but fails consistently after ~580MB (failed about same point, tested twice now) as follows (syslog):

Jul 02 17:10:35 eye kernel: usb 1-1.2: reset high-speed USB device number 69 using ehci-pci    
Jul 02 17:10:45 eye kernel: usb 1-1.2: device not accepting address 69, error -110
Jul 02 17:10:45 eye kernel: usb 1-1.2: reset high-speed USB device number 69 using ehci-pci
Jul 02 17:10:56 eye kernel: usb 1-1.2: device not accepting address 69, error -110
Jul 02 17:10:56 eye kernel: usb 1-1.2: reset high-speed USB device number 69 using ehci-pci
Jul 02 17:10:58 eye kernel: sd 6:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jul 02 17:10:58 eye kernel: sd 6:0:0:0: [sdb] tag#0 CDB: Write(10) 2a 00 ce a3 2c 88 00 00 80 00                        
Jul 02 17:10:58 eye kernel: blk_update_request: I/O error, dev sdb, sector 3466800264 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
Jul 02 17:10:58 eye kernel: zio pool=zb2t01 vdev=/dev/disk/by-id/usb-WDC_WD20_SPZX-22UA7T0_0000000000016742-0:0-part1 error=5 type=2 offset=1774999703552 size=1048576 flags=40080caa

So at least the first drive still works, but every reconnect of the second mirror drive restarts a full resilver, which always fails - after nearly 7 hours of resilver!

Also wrote my steps as part of ZFS tutorials here: https://github.com/zenaan/quick-fixes-ftfw/blob/master/zfs/zfs.md#user-content-step-5b---clear-resilvering-errors

This "restart resilver on every interruption" ZFS bug, may be considered a sort of ultra conservative, ultra paranoid thing.

But the fact is that the ultra paranoid can run a scrub after an interrupted resilver later finishes, and ZFS could anyway (possibly) auto schedule a scrub if the resilver has been interrupted at all...

The irony of this "ultra paranoid" ZFS behaviour is that it is not a usable filesystem on USB drives which might be connected through cheap USB adaptors.

And the sweet irony of reverting ZFS to its previous (slightly less paranoid) behaviour, is that it will instantly become the ONLY safe filesystem to use in such situations.

So +1 for reverting this behaviour and allowing interrupted resilvers to continue when the drive is reconnected.

zenaan avatar Jul 04 '20 03:07 zenaan

@zenaan Slightly off-topic, but you could try to use the usb_storage driver instead of the UAS driver for the failing disk. I had a USB3 disk failing in a similar fashion and blacklisting the UAS driver fixed it for me. You can do this by adding usb_storage.quirks=xxxx:yyyy:u to your kernel options or an corresponding /etc/modprobe.d entry. xxxx:yyyy is the vendor and product ID of the disk. You can get them with lsusb.

AttilaFueloep avatar Jul 05 '20 12:07 AttilaFueloep

I think there is something wrong with the sequential resilver. Even if @behlendorf says it is expected. The zfs_scan_checkpoint_intval should indicate a time between checkpoints and scan queue full drained.

I set zfs_scan_checkpoint_intval to 30. I then added a disk to a single vdev to make a mirror. What I noticed was the 900GB scanned almost immediately, few seconds. So from what i understood, zfs was trying to write these 900GB before making a snapshots considering the it would have needed more than 30 seconds to do so.

Correct me if I'm wrong.

From wiki I see that there is a parameter called zfs_scan_mem_lim_fact that should limit zfs from grabbing more metadata in memory forcing zfs to flush to disk these block before scanning new blocks.

Correct me if I'm wrong.

At this point I ask:

  1. why does zfs scan 900 gb immediately at resilvering start?
  2. to match zfs_scan_checkpoint_intval should't we allow to set checkpoint also if the queue is not empty?

From my point of view, not knowing the actual implementation, the sequential scrub/resilvering set as default seems a bad solution in respect of the previous algorithm.

Using zfs_scan_legacy solved the problem for now.

Any updates on this matter?

xgiovio avatar Feb 09 '21 23:02 xgiovio

Hello, I have the same problem on my test server with ZFS 2.0.3-1 . Scrub resume doesn't work properly. It starts from 0% after zpool export/import, even if I pause scrub and wait over 16 hours before exporting zpool.

Scrub status before zpool export: ][email protected]:~$ zpool status pool: Pool-0 state: ONLINE scan: scrub paused since Mon Mar 8 15:16:39 2021 scrub started on Mon Mar 8 15:16:16 2021 6.76G scanned, 4.03G issued, 6.76G total 0B repaired, 59.61% done

Scrub status after zpool import: pool: Pool-0 state: ONLINE scan: scrub paused since Mon Mar 8 15:16:39 2021 scrub started on Mon Mar 8 15:16:16 2021 0B scanned, 0B issued, 6.76G total 0B repaired, 0.00% done

adikxpl avatar Mar 09 '21 08:03 adikxpl

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 10 '22 21:03 stale[bot]

@behlendorf wrote:

It's less ideal for pools which are frequently imported/exported since a new checkpoint is not written when a pool is exported (or a scrub is paused). .. Revisiting the default settings may also be worthwhile. It wouldn't be unreasonable for the scrub to broadly take in to consideration your pool geometry and hardware when sizing the memory queues and checkpoint frequency.

Any update on this in general? I ask sort of from necessity due to the github "inactivity" setting/bot.

zenaan avatar Mar 10 '22 22:03 zenaan

No update, but I've gone ahead and marked this as "Not Stale" to make sure it stays open.

behlendorf avatar Mar 11 '22 20:03 behlendorf

The heart of the issue is that for sequential scrub / resilver to write a checkpoint it must first drain the in memory scan queues it has built up.

Thanks for the explanation! Though I wonder if you could give a very basic overview (or link to one) of how the sequential scan is actually implemented? This might help us understand the problem better and come up with ideas for "fixes" or improvements?

If I've understood the basics correctly, legacy scrub just scrubs records in the order it finds them, meaning the scrub can be quite randomised in its order on disk (especially on a long running, fragmented pool) which is why it can hurt performance (lots of random I/O). So where sequential differs is that it scans all the record metadata first and somehow builds up a list (or chunks of a list) of records in the order that they appear on disk, so it can scrub in fewer, linear passes (assuming no other activity)? Does that sound about right?

With that in mind I'm assuming the first pass doesn't actually build a list of literally every record in the pool, as that would require more memory (especially for very large pools), so I would guess it identifies the sector(s) of the disk it wants to scan, looks for all records located within those, sorts them into order then scrubs those, while it grabs records for the next sector(s) and repeats the process (presumably scanning and sorting in advance of the actual scrubbing)?

What confuses me is why checkpointing the sequential scrub should be complicated, as surely all that needs to be saved is a note of which sector(s) were most recently scrubbed for each disk, so the scrubbing can pick up from there when resumed? I guess I'm just unclear on what the scan queues would actually contain or why flushing them should be necessary before storing a basic note of where it left off?

I'm guessing it has something to do with atomicity, but since checksumming and scrubbing is really a statistics game anyway I wonder how careful we really need to be? The chances of a record being corrupted, and then being missed repeatedly in periodic scrubs that are frequently being paused and resumed, seems extremely low? Obviously that's not acceptable for a resilver, but for a scrub it should matter less if there's a tiny chance that something might be missed on a single pass?

Sorry for the text wall, but it would be useful to know what these queues actually store and why they need to be flushed, as there might be an alternative we can use?

Haravikk avatar Aug 06 '22 16:08 Haravikk

@Haravikk Scrub checkpoint does not include "which sector(s) were most recently scrubbed for each disk". It includes pool-wide bookmark (objset, object, level and offset) it scrubbed up to. It came from original unsorted scrub. For sorted scrub this obviously represent only metadata scan stage, still done in that order, but it tells nothing about actual data scrub stage, executed pretty chaotically, trying to scan the most sequential data regions found by metadata scan. For system with relatively small RAM checkpoint should be created every couple hours by flushing the block queue. It is not a problem because metadata scan will any way be done in many small chunks. But if system has enough RAM to keep all the block pointers, I think it may never create bookmark at all until it complete all the scrub. May be I am stretching it a bit, but it is not impossible. Collecting more pointers we are doing scrub more sequential (ideally we collect all thepointers), but same time increasing time between possible checkpoints.

amotin avatar Aug 10 '22 02:08 amotin

Just ran into this issue again, had to do a restart with a scrub in progress that had been running for 16 hours, when I re-imported it resumed the scrub from 0.00%, even though I had my checkpoint interval set to 30 minutes?

If the new scrub ignores the interval, then at the very least we need a way to force a new checkpoint manually, e.g- when we run zpool scrub -p to pause, or zpool export?

Haravikk avatar Sep 14 '22 08:09 Haravikk

+1 to @Haravikk: the new scan code has been totally useless in my use case where I frequently have to export and later reimport the pool before the scrub/resilver is over.

I've set the checkpoint interval to as low as 1 minute, have tried scrub -p before exporting, and every other advice provided here, and nothing helps: even when the scrub is over 90% done, when the pool is reimported it restarts from zero (and therefore, in my use case, it never finishes).

The only solution is to turn on legacy scan -- then the scrub resumes from where it was and everything is golden.

What I've done is to set legacy scan on in all my machines and forget about the new scan code, advantageous as it may be for cases where the scrub never needs to be interrupted.

DurvalMenezes avatar Sep 14 '22 12:09 DurvalMenezes

Scrub restart behavior depends on the amount of memory. The more memory system has, the more it can accumulate block pointers for scan, the more time it will take to checkpoint the process. In ultimate case it may be that whole pool gets scanned at once and checkpoint won't happen until the scrub completes. Restrictions on the amount of scan could make it easier, but then scrub would be less sequential.

amotin avatar Sep 14 '22 18:09 amotin

@amotin thanks for the additional data. But then, what the heck does zfs_scan_checkpoint_intval do? Is it just a placebo?

Also, why doesn't the effing scan checkpoint gets written when the pool is exported? It stands to reason that it should get written then and there at the very least, or else all the work so far just gets lost... just like it's indeed getting lost for me, for @Haravikk and quite a few others just here in this thread. And this should be independent of the amount of memory, the relative speed of the CPU, or whatever else, no? Or am I missing something?

Sorry for the rant, but this does get irritating, seeing an obvious issue like that just being dismissed time and again for almost 3 years straight with (AFAICS) no proper reason.

DurvalMenezes avatar Sep 14 '22 20:09 DurvalMenezes