btrfs-progs
btrfs-progs copied to clipboard
`btrfs scrub start -r` tries to write data unless mounted read-only
Happened to me while readonly-checking a recovered md raid.
System information:
# btrfs --version
btrfs-progs v6.12
-EXPERIMENTAL -INJECT -STATIC +LZO +ZSTD +UDEV +FSVERITY +ZONED CRYPTO=builtin
# uname -a
Linux <redacted> 6.12.5-gentoo-dist #1 SMP PREEMPT_DYNAMIC Sun Dec 15 03:17:02 -00 2024 x86_64 Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz GenuineIntel GNU/Linux
This lsblk snip visualizes the block device layers:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 4,5T 0 loop
└─md127 9:127 0 13,6T 1 raid5
├─vg--archive-data--crypt 253:0 0 4T 0 lvm
│ └─data 253:3 0 4T 0 crypt /run/media/system/dm-3
Note, that md127 was started in readonly mode.
When running btrfs scrub -r on the fs of data (mounted rw), the kernel reports attempted writes to the read-only device md127 after about 10G of scrubbed data:
[174366.203678] BTRFS info (device dm-3): first mount of filesystem e18f0c40-88de-413f-9d7e-dcc8136ad6dd
[174366.203691] BTRFS info (device dm-3): using crc32c (crc32c-intel) checksum algorithm
[174366.203696] BTRFS info (device dm-3): using free-space-tree
[174441.289198] BTRFS info (device dm-3): scrub: started on devid 1
[174475.439500] Trying to write to read-only block-device md127
[174475.439546] btrfs_dev_stat_inc_and_print: 362 callbacks suppressed
[174475.439554] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[174475.439588] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[174475.439610] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[174475.439657] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[174475.439693] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[174475.439722] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[174475.439758] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[174475.439787] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
[174475.439815] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[174475.439852] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[174475.445886] BTRFS: error (device dm-3) in btrfs_commit_transaction:2523: errno=-5 IO failure (Error while writing out transaction)
[174475.445915] BTRFS info (device dm-3 state E): forced readonly
[174475.445927] BTRFS warning (device dm-3 state E): Skipping commit of aborted transaction.
[174475.445938] BTRFS error (device dm-3 state EA): Transaction aborted (error -5)
[174475.445948] BTRFS: error (device dm-3 state EA) in cleanup_transaction:2017: errno=-5 IO failure
[174475.446157] BTRFS warning (device dm-3 state EA): failed setting block group ro: -5
[174475.446192] BTRFS info (device dm-3 state EA): scrub: not finished on devid 1 with status: -5
Everything's fine when mounted ro.
It is expected that Btrfs tries to write to the block devices, even when mounting ro (log replay, etc). I do not think btrfs can run on a ro block device.
It is expected that Btrfs tries to write to the block devices, even when mounting ro (log replay, etc). I do not think btrfs can run on a ro block device.
The man-page - btrfs-scrub(8) - about the -r flag:
run in read-only mode, do not attempt to correct anything, can be run on a read-only filesystem
As i wrote, everything's fine when mounted ro. No complaints about writes to an ro-device.
There are multiple agents here. The documentation could be clearer.
The scrub is read-only, i.e. errors found in blocks that are read and verified by the scrub ioctl are not corrected.
The filesystem is read-write. Errors have been found while running the scrub, so the device stats are incremented. These updates to the device stats items will be committed in the next transaction, which is what failed in the logs above.
Also, scrub reads the filesystem metadata trees in order to get device maps, extent maps, and data csums for verification. If any of these reads fail, the filesystem will attempt to correct these pages on disk by writing the correct data over the incorrect data.
If any other process reads the filesystem while the scrub is running, the other process is not affected by the -r flag on scrub. If those reads encounter correctable errors, the filesystem will attempt to correct the data and overwrite bad blocks.
Try it with the preferred metadata patches and set up data-only and metadata-only drives. You should see that scrub -r will never write to a data-only drive.
That's what I guessed too after finding out I forgot to mount ro the first time. A process running with an ro option causing writes was still scary enough for me to report it.
The documentation could be clearer.
I agree. While this might be a corner-case, I still think it should be noted, that the fs itself could still try to fix stuff by itself.
Firstly, if scrub finds no error, it should not trigger any write into the fs, thus even if the target block device is RO, and no data/metadata/superblock errors are found, scrub itself will not trigger the write.
According to your output, at least scrub found no error so far, so the write is not triggered by scrub itself.
The direct cause is that, there is a transaction needs to be committed, and we failed to commit the transaction.
The root cause is that, since scrub is done on commit roots, to avoid write and scrub on the same block group, we mark the current scrub target as read-only.
But that marking read-only operation needs to start a transaction and even force a chunk allocation, which will need to join/start a new transaction, which will cause new metadata to be created and written back. And that writeback triggered the error.
That's why scrub provides read-only mode, which will not try to allocate a chunk (aka, update the metadata) during scrub.
Then talking about why if your fs is mount RO, even a RW scrub will be fine.
That's because the function btrfs_inc_block_group_ro() utilized by scrub will automatically avoid chunk allocation if the fs is already mounted RO, thus even if it's a RW scrub, as long as no error is found, everything is fine.
So there is nothing special, nothing related to whatever patchset, it's just some corner cases related to scrub implementation. The overall rules are:
-
RW scrub on RW fs High chance to write to the fs, no matter if errors are found.
-
RW scrub on RO fs If no errors found, it's the same as RO scrub
-
RO scrub on RO fs Purely RO.
-
RO scrub on RW fs Scrub itself will not cause any write by itself.
And your report matches the first RW scrub on RW fs case, thus write is expected.
And your report matches the first RW scrub on RW fs case, thus write is expected.
That statement is not true. I clearly stated that i started an RO scrub on an RW fs which resides on an RO device.
Worth mentioning:
I successfully copied all of the FS contents in that setup without triggering the error. Only the scrub (or any intentional write operation) would trigger it.
Since you already closed this issue, I guess you do not deem "RO scrub may cause writes to the underlaying device unless mounted RO" worthy enough to be noted?
OK, the problem is in the btrfs_inc_block_group_ro(), which doesn't really honor the scrub RO, but only the fs RO flag.
Thus a RO scrub will trigger a transaction on RW mounted fs.
I can add an extra check to avoid this. Although on such RW mounted fs, you may hit -ENOSPC if there is not much space left.
which doesn't really honor the scrub RO, but only the fs RO flag
This sounds unintentional and IMHO deserves to be fixed. Thank you very much!
Although on such RW mounted fs, you may hit -ENOSPC if there is not much space left.
This seems like a very minor inconvenience.
Unfortunately the code is not that easy to handle the RO scrub on RW mount:
-
We have to start a transaction To ensure there is no conflicts between marking block group RO, and writing back the target block group. Thus we hold a transaction handle to prevent the current transaction to be committed, until we lock the
ro_block_group_mutex. -
We will still update the super blocks even if the current transaction is empty
So this means even if we skip the chunk allocation part, we will have an empty transaction to commit and have to update the super block.
But if we skip holding a transaction and continue, it means we will have the chance to conflict and corrupt the target block group. The best solution is to make btrfs to detect empty transaction and fully skip it (aka, no writes at all), but will require quite some changes.
I'd go with a doc update for now, to warn about the modification to the fs.