zfs Feature: Ability to repair defective on-disk data

Originally I came up with this in Automated orphan file recovery:

A zfs repair dataset filename < file function to fix on-disk data corruption would be a good thing.

ZFS knows the checksums of the on-disk data, so it should be able to locate bad blocks in a file (even in a snapshot) and rewrite them with (using the known checksums) verified as good data, inplace. No block pointer rewrite or anything, just restoring damaged sectors on-disk to the contents they should have (so the physical drives will remap them in case the sectors are pending for reallocation).

I would guess this would be welcome by quite some as it could be used to recover from data errors (using a backup of the affected files from that point in time) without having to destroy all snapshots referencing the bad blocks.

Sep 15 '18 19:09 GregorKopka

Good idea.

A check of the file from backups, to other known good parts of the file to be fixed, should also be done. That way we would have some confidence that it's the right version of the backup file. This does add complexity, since the destination file may be compressed, de-dupped and encrypted.

I think we should consider that some serious recovery functions like this one, should be put in a separate command. Perhaps using zmaint repair.

Sep 16 '18 18:09 Lady-Galadriel

The check of the supplied data would be implicit, as a successful repair could only happen in case the still existing on-disk checksum matches the file contents.

I honestly don't care how the command would be called in the end, the syntax was mainly for illustration.

Sep 16 '18 19:09 GregorKopka

I don't understand why you have to destroy snapshots. Can you explain?

Sep 16 '18 20:09 richardelling

@GregorKopka One big problem with functionality like this would be that when a checksum fails on a data block, you know that either the checksum or the data block is wrong, but not necessarily which one.

Similarly, since the checksums are on the blocks as-written, and it's completely possible for you to compress a block with two different implementations of the same algorithm and get different results while still decompressing to the same block, you're not really going to be able to sanity-check this in any useful fashion, just run along and clobber all the blocks with the copies you're feeding in. (Not to mention you'd probably be clobbering the blocks in-place rather than CoW like everything else, and that's such a horrific can of worms.) You'd also really want to not pass the replacement data in via pipe, because then you can't have a sanity check that you're passing in data with the same expected length as the thing you're trying to repair.

@richardelling If you have a file in 30 snapshots that has a data block mangled, you get to nuke the snapshots if you want the error to go away. I believe the proposal is for the ability to hand something a copy that you promise is the intact version of the exact file and in-place overwrite it.

Sep 16 '18 21:09 rincebrain

It is absurd to delete data just to make an error message go away. For an operations team, just annotate as an exception (we don't care about this particular error message ever again). This works fine for the use case because the original data still exists therefore this is not a data loss event.

Sep 16 '18 21:09 richardelling

@rincebrain Metadata is always redundant (at least one more copy than the data has), so a defect there is less likely, also it would lead to a defect that can only be solved by destroying the dataset/pool. This feature is purely aimed at repairing on-disk data (in the sense of !=metadata) errors.

As zfs tracks how (what compression) blocks are written with it can used the same mode to process the replacement data and will end up with the correct checksum when the replacement data is equals the original. The feature would only rewrite blocks (in-place) when 1. they can't be read from the pool and 2. the replacement checksum matches. Hence passing shorter/wrong replacement data in (with a pipe or by whatever means) wouldn't be a problem: Should you feed the wrong data nothing could happen (checksum mismatch), should the replacement data be short then it would only repair on-disk defects while it lasts and leave the rest unprocessed.

@richardelling while you might have a point it dosn't account for users able to access the snapshot scheme - these can react quite differently than professionals when confronted with being unable to read a file. Surely the admin could restore the dataset from a zfs based backup (that kept the snapshot chain), but it would be way easier when such a defect could simply be repaired in-place.

I brought this up as I expect the problem to come up more often in the future as small (non-redundant) pools (in clients, small system backups on USB drives, ...) are likely to get more commom. Having 'destroy and restore from backup' as only only remedy isn't a very good selling point.

Sep 17 '18 06:09 GregorKopka

This is essentially how the self-healing functionality works today, with the exception that the replacement data is being provided by the user. There are some complications with encryption, compression, and deduplication as mentioned above but none of them should be a deal breakers. In the worst case, when the checksums differ the attempted repair would fail. Limiting the repair functionality to level 0 data blocks would also be a good idea for safety.

Sep 17 '18 22:09 behlendorf

Limiting the repair functionality to level 0 data blocks would also be a good idea for safety.

Sorry? Do you have a link toward some information about data block levels, please?

Sep 17 '18 23:09 GregorKopka

It is true that if you have the original file, then you'll know the correct data and with the block pointer we know the compression and DVAs. So it is clearly possible to do this.

But... some devices fail such that the LBA is not correctable. An obvious case is a disk with a full defect list and the data cannot be remapped. This can and does happen more often than one would think. So it is not clear that this would work, and you'll be back to my point: ignoring the error message is a better strategy than destroying data.

Sep 18 '18 00:09 richardelling

@richardelling I think everyone agrees that ignoring the error is better than destroying data. Do you agree that fixing the data (when possible) is better than ignoring the error? If so, there isn't really any disagreement here.

Sep 18 '18 00:09 rlaager

bugs notwithstanding.

The CLI design will be challenging, as will making it completely idiot-proof.

Sep 18 '18 00:09 richardelling

a link toward some information about data block levels, please?

It's a little old but still accurate http://www.giis.co.in/Zfs_ondiskformat.pdf, you want to look at pages 24 and 25. The specific concern is that because ZFS will trust the block contents as long as the checksum matches, we shouldn't allow it to overwrite any internal metadata. User file data will always be stored in the level 0 blocks.

Sep 18 '18 00:09 behlendorf

The CLI design will be challenging, as will making it completely idiot-proof.

There are a few suggested features for repair operations which would be reasonable to add to a new user space tool. Such a tool wouldn't need to be completely idiot-proof, but it would nice to have when you are otherwise out of options.

#3111 - Offline spacemap rebuild.
#6209 - Offline pool scrubbing
#7912 - Manual data repair
#2510, #4187 - Manual label modification

Sep 18 '18 00:09 behlendorf

The specific concern is that because ZFS will trust the block contents as long as the checksum matches, we shouldn't allow it to overwrite any internal metadata.

Yes. I intended it to rewrite only the checksum failed (level 0, as I now know they are called) data blocks of the defective file with the supplied replacement data, but only if this leads to a correct checksum for the rewritten block.

Feeding non-matching data thus can't lead to an on-disk change, should be idiot-proof.

Metadata will not be written/modified at all.

Sep 18 '18 01:09 GregorKopka

In case of disk read error and no self healing or backup data to restore from, I would like to have a command to force ZFS to zero out the badblock causing the disk to remap the sector and recalculate and update the checksum. Or ZFS can do the remapping at the higher level which maybe better if there are free disk spaces as the on disk remapping is limited to the number of built-in reserved sectors.

Of course doing this would corrupt the file (but it is already corrupted at that point anyways). However a lot of file format can handle/recover from file corruption at the application level like video files where it would just blip and move on. Yet the current behavior where ZFS throw a read disk error cause some application would just stop.

Currently even when I accept the file partial corruption, there is no way that I know of to clear a file / ZFS from bad sector (without mirror, raidz) except deleting the file and all related snapshots.

Even if I zero out the bad sector manually using dd/hdparm and forcing the disk to remap, the ZFS checksum is still wrong and ZFS still error out.

Oct 11 '18 19:10 aletus

@aletus it is not required to change ZFS to implement this functionality. It can all be done in userland.

Oct 11 '18 21:10 richardelling

Hi @richardelling would you give some direction on how to accomplish this in userland?

As mentioned direct dd/hdparm write into the disk does not update the checksum which cause scrub error. And even if I am able to identify the byte offset into the file where the bad sector is and force a write into that section of the file, ZFS COW would actually redirect that write somewhere else and all the snapshots would still be corrupted and show errors on scrubing.

Am I misunderstanding how this work?

Oct 12 '18 03:10 aletus

@aletus You cannot update a checksum (or by extension anything else) in an existing ZFS block. That is fundamentally part of how ZFS works. The change being requested here is to allow the user to provide the correct data, which matches the checksum. That is possible. What you are suggesting is (within reason) not.

Oct 12 '18 03:10 rlaager

Hi @rlaager. Now I am even more confused.

@richardelling mentioned it is possible to do this in userland without any change in ZFS, although he didn't mention how. And you are saying it is not possible even within ZFS.

Oct 12 '18 23:10 aletus

@aletus simply find the block in the file that is corrupted (dd can do this) and write a new block (dd can also do this). Step and repeat until dd passes.

Oct 13 '18 01:10 richardelling

@richardelling that won't overwrite the old mangled block in historical copies,though, it'll just allocate a new one. Or are you suggesting reading until you get EIEIO and then digging through dmesg to see where it's complaining about, then carefully having dd overwrite on the raw device?

Oct 13 '18 01:10 rincebrain

@rincebrain I did what you mentioned getting the bad sector number in dmesg and write the block on the raw device using either dd or hdparm. However that does not update the ZFS checksum and I still end up with read error due to checksum when I read the file back in the application layer. I don't know any way to avoid that read error.

I really think we have to do this at the ZFS layer and and have ZFS update the checksum for the zeroed out block at the same time.

Oct 13 '18 02:10 aletus

@aletus Yes, because if the block was compressed or the location of the read errors was some metadata, or many other things, that wouldn't fly.

I don't think you understand what we're telling you. ZFS really, really does not have a mechanism for mutating extant data in-place, or repointing old things to new modified locations retroactively. So you could either go compute a checksum collision (lol) to overwrite the block with, or provide a valid copy of the data for it to compress and appropriately store.

This thread is about the desire to hand such a valid data source to ZFS from userland. It's not likely to happen that someone will implement a whole indirection layer just so you can get incorrect data out of a file without zpool status reporting issues.

Oct 13 '18 04:10 rincebrain

@rincebrain Understood what I am asking for is not possible within ZFS architecture and implementation. The comment from @richardelling had my hopes up :)

Just a summary of my understanding for those who later Google this, if you have disk errors / pending sectors waiting for remap on ZFS and you have no redundancy (mirrors, raidz)... and you also have snapshots there is currently no way to force a remap of those sectors nor clear error from zpool scrub status even accepting loss of those of files. So your scrubs will always shows errors and your SMART status will always show pending sectors until you delete all the associate snapshots and the original file.

In my case I have automatic snapshots that get clean up after a year, so my theory is to use ddrescue to copy the file to current "live" copy with the bad blocks zeroed out. This is a new duplicate copy of the file, but it is readable without disk errors unlike the old one. Then wait a year for the snapshots that contain the reference to the bad blocks and the old file to age out and get deleted.

At that point the zpool status errors should clear up if I understand things correctly.

By the way @rincebrain my understanding is ZFS stores two copies of metadata spread apart so if a bad sector happens to be in the metadata section it should be able to correct itself using the other copy right?

Oct 13 '18 05:10 aletus

@aletus You understand correctly, as long as there is redundancy (either from the vdev being a mirror/raidz or from ZFS maintaining one more copies of metadata than it does for the data it references, see man zfs for the copies dataset property - especially the warning in the last paragraph) on-disk errors can (and will) be repaired the moment they are detected.

That's the reason why metadata errors are less likely (than data block errors) - and more likely to not stem from corruption of a block already sitting on-disk (unless the drive(s) experience massive failures) but garbage been written in the first place (corrupted before it hits the storage medium by a defect in code, CPU, RAM, controller, cabling, ...).

Oct 13 '18 06:10 GregorKopka

See here for a rudimentary tool for ideas:

https://github.com/zfsonlinux/zfs/issues/9313

https://www.joyent.com/blog/zfs-forensics-recovering-files-from-a-destroyed-zpool

Sep 11 '19 10:09 zenaan

Hi, I know this issue is old, but I think it's still a thing. I recently fixed errors on my pool this way, so I thought I should share my prototype https://github.com/t-oster/zfs-repair-dataset. If anyone is interested, there is plenty of room for improvements (confirm checksum, skip healthy blocks, handle compression), but on uncompressed datasets on single-vdev pools it seems to work.

Jan 25 '21 05:01 t-oster

I've implemented a corruption healing zfs receive, see https://github.com/openzfs/zfs/pull/9372

Dec 02 '21 21:12 alek-p

I've implemented a corruption healing zfs receive, see #9372

Thank you for your work.

Aug 09 '22 08:08 GregorKopka

I had a corrupted block in a mirror that I managed to manually repair after a couple weeks of tinkering. The block was both compressed and encrypted, which made the process more challenging.

A takeaway from that experience is that for a repair feature like this to work, we might also need tools to inspect the corrupted data on disk. For encrypted datasets, we may need a tool that can decrypt a block (for inspection only) even if the MAC is invalid. To show why these tools would be helpful, let me explain my specific corruption scenario.

Scrubbing the pool revealed a single checksum failure (on both drives) in a file in an old snapshot. The file was a game asset, so it was non-essential data. However, it was in a filesystem with more important data, and there were many snapshots both before and after the corrupted snapshots, so deleting snapshots was not an acceptable solution. Leaving the corruption was also not acceptable (even though that file was deleted in later snapshots) because it caused zfs send to fail, preventing me from backing up that filesystem.

First, I located the corrupted part of the file (using dd to read) and found it was a 128KB block. I had a working snapshot with an older version of this file, and I also downloaded a fresh copy of the game from Steam to get a newer version. I looked at the block before the corrupted block and searched for a matching sequence of bytes in the other files. For both of the other files, I got a match, but at a different offset. I confirmed that the block before and after the corrupted block were present in the known good files, and with a 128KB gap in between, so I surmised that I could use the data in that gap to recover my corrupted block. Note that the recovered data was at an arbitrary, non-aligned offset in the known good file.

At this point, I needed to compress the recovered data and encrypt it using the block's parameters. I needed to compress the block exactly as it was compressed before. If the compressed block had even one bit flipped, the checksum and MAC would mismatch, and all I would know is that I got something wrong, but not what. In order to see whether I was getting the compression right, I wanted to decrypt the corrupt block on disk.

I couldn't use zdb to learn more about the block because the filesystem is encrypted, so I patched ZFS to print the block pointer to the debug log on checksum failure. That got me the DVA (so I could locate the encrypted block on disk) and the encryption parameters (IV and salt). I tried to use openssl to decrypt the block, but couldn't get it to work. Ultimately, what worked was patching ZFS again to encrypt a block of zeroes using the corrupted block's parameters. Then, XOR'ing the encrypted zeroes with the corrupted block on disk revealed the corrupted, compressed block.

It was obvious that the decryption worked because the block was padded with zeroes at the end. (Plus, it turns out LZ4 looks different than random data, and you can develop an intuition for it by staring at LZ4 long enough!) Using the lz4utility, I tried compressing the recovered block and comparing the result with the corrupted block. That got me close, but I couldn't get an exact match. It was enough to confirm that the first 4KB was corrupted of the 12KB compressed block. I used the ZFS function for LZ4 to compress the recovered block, and that got me an exact match with the latter 8KB. Finally, I XOR'd that with the encrypted zeroes, wrote that to the block on one disk in a mirror, and watched as a scrub repaired the block on the other disk.

Takeaways and implications for a repair tool:

Simply zfs repair dataset filename < file would not have worked here. I was able to find the contents of the corrupted block, but I had to manually locate it at a different, unaligned offset in another file. To do that, I needed some information from the corrupted file in order to locate the corrupted block in the good file. Alternately, zfs repair could try scanning a block-sized window over the input byte-by-byte.
The file was large, so I was able to locate a matching block of data by looking at blocks before and after the corrupted block. If the file only had a single block (with some bits or sectors corrupted and others intact), it might be necessary to decrypt the block in order to have the context to locate another copy of the corrupted sector.
I was not able to use command-line tools like openssl and lz4 to work with raw ZFS data. A repair tool should support encryption and compression natively.
Recovering a partially corrupted block that is also compressed is not possible (because decompression would fail). Users (or the repair command itself) may still benefit from partially recovering the compressed data.

Feb 19 '23 17:02 Majiir

zfs zfs copied to clipboard

Feature: Ability to repair defective on-disk data

zfs
zfs copied to clipboard