btrfs-progs
btrfs-progs copied to clipboard
feature: include block number and drive number in metadata blocks
Problem Description
Reading in issue #319 that the 2 blocks created by dup for a single block of metadata are identical implies that the location of the block is not included in the block as otherwise 2 blocks from 2 different locations could never be identical as they would at least differ in the block number. This means that btrfs cannot detect that a wrong block has been read when the block address is corrupted and the wrong block happens to be a valid metadata block with properties that are consistent with the immediate context. Furthermore, if the OS somehow errs on what drive should return the block, a block from a different drive may be read and again if the contents of this block happen to have the right format btrfs may fail to detect that something is wrong. In both cases, the blocks read could not only be from a wrong location within the same filesystem but could also come from a different filesystem.
Suggestion
Within the data that is being checksumed, include the location, component drive ID and filesystem UUID in the data. When a block is read, verify the checksum, location, drive ID and filesystem UUID.
Workaround
An encryption layer that includes the location, i.e. uses an IV, such as LUKS, helps to detect location errors if they happen below the encryption layer, e.g. on the drive cable or in the drive. However, this does not protect against errors in or above the encryption layer.
Severity
I've never heard of corruption of the location, e.g. with LBA, or of a drive returning a block from a wrong location. Presumably, the location information is heavily protected with a combination of techniques during transmission, e.g. ECC, though how well it is protected witll vary with the link type. Also, it seems highly unlikely that the OS would read a block from a wrong location or a wrong drive.
However, the inclusion of cryptographic-strength hash functions as checksum algorithms suggests that (some) users want maximum protection of filesystem metadata. These users probably would welcome if btrfs could detect that a layer below btrfs has replayed a different block than requested.
TODO
This feature request probably needs to go into the kernel bugzilla and/or the btrfs project ideas page.
2 blocks created by dup for a single block of metadata are identical implies that the location of the block is not included in the block
It implies no such thing.
btrfs cannot detect that a wrong block has been read when the block address is corrupted and the wrong block happens to be a valid metadata block with properties that are consistent with the immediate context.
Completely incorrect. btrfs detects this event easily--it happens routinely in the field when underlying storage starts to go bad.
The location of the block is included in the btrfs metadata block header; however, the block location is the block's virtual address, so it's the same address for all copies of the block (or the logical location of data reconstructed from parity in raid56, which has no physical location on the storage because it exists only in memory).
Virtual to physical translation provides the device IDs. Metadata block headers in btrfs already include the filesystem UUID and checksum.
Your proposal is missing a critical component already present in the btrfs metadata header: the transid (a Lamport timestamp which is used to confirm that both ends of a block reference were updated in the same transaction). This prevents the storage device from presenting an old version of metadata pages without being detected (unless it presents a complete old filesystem tree).
The discussion on #319 is about intentionally changing bits in one of the dup copies to try to force SSD firmware to write the two copies to separate locations in the underlying storage media. It is defeated if the SSD firmware implements data compression (as high-endurance SSDs do to reduce write load), in which case a few different bits won't help, as the compression will reduce the second copy to a small delta relative to the first copy in the storage media, eliminating redundancy and reducing resilience.
Full encryption of the block is a workaround in #319 because there is no way the SSD can compress the distinct copies without breaking the encryption. This allows metadata to be self-healed from a mirror copy on the SSD if the SSD stores both copies in different physical media locations, and is otherwise healthy enough to read them.
I've never heard of corruption of the location, e.g. with LBA, or of a drive returning a block from a wrong location.
You haven't been around long enough. In the Before Times (2016), spinning disks would occasionally read or write sectors from the wrong track every few billion sectors or so. SSDs have assorted failure modes in their LBA translation tables, especially at the low end of the market. Buses without stable device identifiers (e.g. USB) can switch devices behind a logical device ID during a bus reset or reenumeration event. Device-mapper code has had subtle bugs in past kernel releases. The list goes on...
btrfs can't survive all of these problems, but it can detect them easily.
Thanks. Sounds like this issue can be closed.
Just to check and to help other readers, here my attempt at understanding what you mean with "the block's virtual address": In the on-disk format description, there is a distinction between
- physical address of this block (different for mirrors)
- Logical address of this node
and https://btrfs.wiki.kernel.org/index.php/Data_Structures shows structures managing "extents" in a blue box. Is this extent structure defining the mapping from virtual to physical locations of the filesystem?
Let's consider raid1c3 with 4 drives, 2 with 12 TiB and 2 with 6 TiB. From my limited understanding, I guess the first 4 stripes A to D should look like this:
Disk 1: A B C D
Disk 2: A B C D
Disk 3: A C
Disk 4: B D
Does "virtual address" mean that one part of the address identifies the stripe, e.g. bits 20-64 for 1 MiB extent blocks, which is translated to potentially different physical addresses on each participating drive by the extent data structure, and the remaining bits of the virtual address give the address within the block ("extent item")? The logical address of the first node in stripe B would then be 0x100000 in all 3 copies but the physical address on disks 1 and 2 could be 0x220000 and on disk 4 0x120000, assuming the first 0x20000 are used for superblock and (some of) the extent metadata.
Do I understand correctly, in this picture, that a single disk dup layout looks like this:
Disk 1: A A B B C C D D
or maybe like this (to account for that drive errors tend to cluster):
Disk 1: A B C D [unused space up to middle of disk] A B C D
?
Again, as the virtual addresses are the same in both copies of a block, the copies are identical. Correct?
If so, the failure scenario is reduced to the confusion of the location with the location of the identical copy and in this case the error does not matter as we got the correct data nonetheless, unless in the rare case that this retrieved copy is corrupted. The latter could be caught by reading all copies one more time when all copies are exhausted, without the need to detect the location error as it is anyway a good idea to try reading again before reporting an error to the filesystem user. Whether or not this is done is a separate issue.
Extents exist entirely within the virtual address space in btrfs.
The virtual-to-physical mapping is handled by the combination of chunk tree, block_group items, and dev_extent trees (red and orange boxes in the diagrams).
Note that these diagrams are schematic in nature, and do not completely describe all details of the on-disk structure. e.g. neither block groups nor metadata pages are represented on the diagram, only the items they contain.
Does "virtual address" mean that one part of the address identifies the stripe,
The mapping is more arbitrary. The virtual address space is divided into "chunks", which are contiguous regions of virtual address space. Each chunk has an associated RAID profile and a list of stripes (device_id, device_offset pairs) giving the physical location of the chunk(s) on the device(s).
The interpretation of a virtual address within the chunk's virtual address range is specific to the RAID profile. In simple cases like dup, single, and raid1, the physical address on device N is virtual - chunk_start + dev_offset[N]. In more complicated cases like raid5, the calculation takes into account the positions of parity blocks (which are also handled by the RAID profile), and distributes the data blocks in stripes over the devices.
Your first picture for dup is more correct:
Disk 1: [start of disk] [1M unused] A A B B C C D D [unallocated space] [end of disk]
Each of "A" "B" is a chunk's dev_extent. dev_extents can be up to 1GB long, in which case there is a minimum 1GB distance between copies of any particular data block in device LBA space.
it is anyway a good idea to try reading again before reporting an error to the filesystem user.
That's up to how the user configured the devices. Like most filesystems, btrfs assumes all desired retries have been done by the underlying block layer. Note that in mirroring/parity cases, retries are not desirable--btrfs will automatically repair damaged data from mirror copies, so there's no need to try very hard to read each individual copy (certainly not two minutes of retry attempts per device, as consumer drive firmware does).
Thanks for this glimpse at the inner working. Makes sense that there are more layers to support the needed flexibility.
in mirroring/parity cases, retries are not desirable
I meant to retry after all mirror copies failed. Before that, one would probably not want any re-tries so that one reaches a working copy as quickly as possible.
certainly not two minutes of retry attempts per device, as consumer drive firmware does
Does btrfs read asynchronously, e.g. one thread per drive, to be able to switch to a mirror more quickly in this situation?
I meant to retry after all mirror copies failed.
Interesting idea, though it would be a narrow use case (useful in some low-probability scenario between high-probability scenarios where it's not useful).
Normally "all mirrors failed" means we immediately pull the host out of the service pool for maintenance because it can no longer run its application, and we want the application to know it's down as soon as possible so we don't want delayed errors. It's not a stable state we'd try to keep running in production--drives don't normally get better once they start to fail, but they do get a lot slower, which will cause applications to be too unresponsive to be useful.
Once the hardware is out of production, we might increase retries in the block or device layers and try to scrape off data.
Does btrfs read asynchronously, e.g. one thread per drive, to be able to switch to a mirror more quickly in this situation?
Currently it reads one mirror, and if that fails (either an error from the underlying device, or a csum/transid verification failure) it reads the other mirrors in turn. It doesn't issue reads on both copies at once because that would be extremely wasteful in the normal non-error case.
In the error case, when there are no backups and all devices are failing, it's a better plan to first transfer the device contents to healthy hardware (with extreme retries) and then run btrfs on the new hardware. Running a filesystem on failing hardware never ends well, and you want to minimize that as much as possible.