bcachefs icon indicating copy to clipboard operation
bcachefs copied to clipboard

bcachefs detects but does not repair file corruption.

Open b-r-o-w-n opened this issue 4 years ago • 8 comments

kernel: 5.7.0-g86fa1258a

tools: v0.1-227-g21ade39

I'm experimenting with a "metadata_replicas=2,data_replicas=2,metadata_replicas_required=2,data_replicas_required=2" setup.

I'm manually corrupting one of the copies of a file (/dev/sda1). The copy on /dev/sdb1 is correct.

The filesystem mounts ok and the file can be dumped and its contents are correct.

The fs detects the corrupted file on /dev/sda1 (it always seems to read /dev/sda1 first) and then uses the good copy on /dev/sdb1.

It generates this dmesg:

[ 621.287148] bcachefs (df0f537e-dcca-443a-82ea-e46bc4ca6f02): IO error on sda1 for data checksum error, inode 4098 offset 0: expected 0:6dfad270 got 0:5227d903 (type 5)

But, it never fixes the corruption. I also cannot seem to force it to fix the corruption manually (or I do not know how to) If I unmount the filesystem and remount I get the same error on file read (ie.. corruption was not fixed)

One nice thing would be some error counters in /sys/fs/bcachefs/...../ to be pegged. If there are error counters please let me know their names.

thanks

b-r-o-w-n avatar Aug 20 '20 20:08 b-r-o-w-n

Hey, just wanted to let you know that this is in the TODO:

TODO:

  • scrub, repair of replicated data when one of the replicas fail the checksum check - erasure coding needs repair code (it'll do reconstruct reads, but we don't have code to rewrite bad blocks in a stripe yet. this is going to be a hassle until we get backpointers)
  • fsck isn't checking refcounts of reflinked extents yet
  • bcachefs tests in ktest need to be moved to xfstests
  • user docs are still very minimal

Lyamc avatar Nov 02 '20 02:11 Lyamc

@Lyamc This has nothing to do with Erasure codes

I had similar issues.

$ bcachefs show-super /dev/disk/by-id/<blah> 

This showed me that one of my devices state was "read only".

changing it to readwrite with:

$ echo "readwrite" > /sys/fs/bcachefs/<UUID>/dev-<num>/state

then run:

bcachefs fsck -p <disk-0> <disk-1> ...

This should fix the "silent" fails trying to repair the drive.

Bug report should be called "fsck should work on readonly drives".

YellowOnion avatar Apr 13 '21 12:04 YellowOnion

Has any progress been made on this issue?

I'm not sure I agree with "YellowOnion" above but I'm not qualified to comment as I do not know how its supposed to work.

In the above test I have a 2 copy fs. When I corrupt the first copy the second copy is used. No error counters are incremented. It just moves to the next available copy or generates an i/o error because the error is not recoverable.

So, the question is: is the design going to fix it on the fly when its detected or wail till some kind of "scrub" operation/command is triggered sometime later?

When you hit the checksum error in the code and find the good data you know the good data location and the bad data location.

So, does anyone have an ball park estimate when an update might arrive? (No is a valid answer.)

b-r-o-w-n avatar Jul 07 '21 00:07 b-r-o-w-n

No ETA yet - focusing on getting snapshots done and merged. Once that's done I'll be able to look at other things, and this is something that really shouldn't take much work.

On Tue, Jul 6, 2021 at 8:42 PM b-r-o-w-n @.***> wrote:

Has any progress been made on this issue?

I'm not sure I agree with "YellowOnion" above but I'm not qualified to comment as I do not know how its supposed to work.

In the above test I have a 2 copy fs. When I corrupt the first copy the second copy is used. No error counters are incremented. It just moves to the next available copy or generates an i/o error because the error is not recoverable.

So, the question is: is the design going to fix it on the fly when its detected or wail till some kind of "scrub" operation/command is triggered sometime later?

When you hit the checksum error in the code and find the good data you know the good data location and the bad data location.

So, does anyone have an ball park estimate when an update might arrive? (No is a valid answer.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/koverstreet/bcachefs/issues/146#issuecomment-875179147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPGX3U225CKYEMEMCPEEB3TWOPGLANCNFSM4QGSAVEA .

koverstreet avatar Jul 07 '21 00:07 koverstreet

Why did it get closed?

b-r-o-w-n avatar Mar 24 '22 02:03 b-r-o-w-n

It's for a very outdated version, I'm just closing anything that looks like a bug, and is over a year old since the code is in such a flux, I'll reopen it, since it looks like we need to implement scrub (still).

YellowOnion avatar Mar 24 '22 02:03 YellowOnion

Hi,

Any planned progress here? I'm not sure that this is an enhancement. Its like btrfs without the scrub command. You detect bit rot or corruption and skip to the second or third copy but offer no tools to fix the problem.

b-r-o-w-n avatar Jan 11 '23 18:01 b-r-o-w-n

Planned, but I've got higher priority items on my plate right now

koverstreet avatar Jan 12 '23 04:01 koverstreet