btrfs-progs
btrfs-progs copied to clipboard
`btrfs scrub` doesn't offer an obvious way to find the affected file
When it finds any uncorrectable errors, it seems like btrfs scrub doesn't offer an obvious way to find the affected file. Specifically, this is the output I'm getting:
# btrfs scrub status /
UUID: c7272b0e-d685-41ab-89d4-486ccb593444
Scrub started: Fri Jul 5 07:32:52 2024
Status: finished
Duration: 2:36:56
Total to scrub: 180.68GiB
Rate: 19.78MiB/s
Error summary: csum=1
Corrected: 0
Uncorrectable: 1
Unverified: 0
#
There is no file path here, and also no info on how to get one. Then I checked in dmesg where I found this:
10094.516628] BTRFS error (device dm-0): unable to fixup (regular) error at logical 192613056512 on dev /dev/mapper/root physical 193796243456
[10546.842111] BTRFS info (device dm-0): scrub: finished on devid 1 with status: 0
As far as I can tell, there is no file path here either and I couldn't find any other output. My apologies if I missed something obvious.
While I get that the idea is that at this point a change of storage device might be in order for the medium term, this seems to leave the user without any short term actionable solution. I think the obvious thing to do in most cases is to identify the damaged file and to delete and/or re-retrieve it from a backup, but this doesn't seem feasible without knowing which file it is. It would be great if any of the generated output gave more obvious pointers for that purpose.
Mind to give this series a try?
https://lore.kernel.org/linux-btrfs/[email protected]/
It should handle the error message more correctly (even if the corrupted part is no longer accessible from any file, it would show that fact).
grepping error messages out of dmesg is fun, but ideally scrub would provide a list of affected files directly. dmesg is rate-limited, so if you have a scratch on your disk, you only get a few percent of the filenames in each scrub run, and any kind of automation is going to have to worry about issues like filename quoting, renamed files, multiple paths to the same extent, etc. scrub really needs a proper error reporting mechanism.
One way to do it: the scrub ioctl could be provided a buffer, and scrub would fill that buffer with the physical address (uint64) or the subvol/inode/offset tuple (3x uint64) of any bad blocks found, similar to the way the LOGICAL_INO ioctl works. Userspace can then look up all the physical addresses to map them to filenames. If the buffer fills up, then scrub pauses, and userspace can resume the scrub after reporting the filenames (possibly in another thread so scrub can keep running with a fresh buffer).
On the other hand...why tie that feature to scrub? Create a sysfs file which gives you error physical addresses in real time, so userspace can do something like:
while read paddr type repair; do
btrfs ins log -o "$paddr" /fs
done < /sys/fs/btrfs/$UUID/error_log &
btrfs scrub start -Bd /fs
kill $!
(and then move that shell loop into btrfs scrub start -B as a thread, so it reports bad filenames while it runs).
Doing it via the sysfs file means we don't have to create a new scrub ioctl, and we can handle errors that occur outside of scrub context as well. Each line read from the sysfs file could have fields like:
- the physical address of the error
- the error type (read, write, flush, corruption, generation) as in btrfs dev stats
- the repair result (corrected, uncorrected, readonly) describing whether btrfs tried to repair, successfully repaired, or didn't try to repair.
- the block type (data, metadata, system, or superblock)
It might be more extensible to give each item a tag (paddr=1048576000 type=read repair=corrected) to allow for extensibility. e.g. in some cases the subvol/inode/offset fields are more available depending on the kernel context, or some new tag could be added later to report issues like whole-stripe raid56 errors or device disconnects.
If the file supports blocking reads, then a userspace daemon can passively listen for error events all the time.
I just had run a scrub halfway through, but then had to reboot on a whim. Afterward I ran btfs scrub status / which showed 2 errors. However, the kernel log is cleared, I have no idea how to see any related info. Is it still possible at all? In summary, I feel like this is another reason why scrub should ideally list the affected paths itself, no matter whatever the kernel log may additionally list or not list them.