bees icon indicating copy to clipboard operation
bees copied to clipboard

Safe against power loss?

Open Massimo-B opened this issue 5 months ago • 46 comments

Just a side question about all the low level operations like:

  • bees
  • btrfs balance
  • btrfs-balance-least-used
  • btrfs-extent-same

How save are these operations on power loss? I don't mean data loss, but is there always a risk of filesystem corruption? I need a "Uninterruptible Power Supply"...

Massimo-B avatar Jul 21 '25 05:07 Massimo-B

btrfs uses a transactional update model (except for raid5 and raid6), so a crash rolls back to the last completed transaction.

If the firmware in the underlying devices is correct, all of the above are safe against power loss.

If the firmware in the underlying devices is not correct, the filesystem can be destroyed by any write operation combined with a power failure.

It's important to have correct firmware. This includes the firmware in USB bridges and SATA HBA interfaces.

Devices that have incorrect firmware are easy to spot, because btrfs becomes unrecoverably corrupted after only a few power failures. Chances are that if you have a device with bad firmware, you know it already.

If you have a raid1 of one device with good firmware, and one device with bad firmware, btrfs can correct the bad device as long as the good device is working. If you see parent transid verify failed messages during a boot after a crash, you should reconfigure (disable write caching) or replace (with a different model) the device with the errors. These messages indicate that one drive in the filesystem is reporting that writes were completed when they were not.

bees

Bees will repeat some work on an unclean shutdown (same as killing the process with SIGKILL).

btrfs balance btrfs-balance-least-used

These are the same operation. Balance may have to delete a partially filled block group during the mount after a crash, which may take some time.

btrfs-extent-same

The extent-same ioctl performs a data flush on both files just before comparing them to ensure that no matter which reflink is used, both will point to the same data. It is not guaranteed whether the src or dst physical block will be referenced by dst after the crash (i.e. the dedupe may or may not happen).

Mounting with the default noflushoncommit can cause data written after the last completed transaction to be inconsistent if it is partially flushed with fsync(). Mounting with flushoncommit eliminates this problem.

Zygo avatar Jul 21 '25 06:07 Zygo

Thank you for the explanation.

Mounting with the default noflushoncommit can cause data written after the last completed transaction to be inconsistent if it is partially flushed with fsync().

Resulting in data loss or in filesystem corruption?

Massimo-B avatar Jul 21 '25 08:07 Massimo-B

Resulting in data loss or in filesystem corruption?

I think this means "inconsistent data". Meta data itself should be fine - thus no "corruption" of the file system itself. But a partially written file, e.g. a database file, may have corrupted contents.

I always disable write caching in the hardware. I feel like I wasn't able to get btrfs corruption since I did this. OTOH, btrfs may just be fixed in one or another code path since then. Before I disabled hardware write caching, I had btrfs corruption every now and then, with different disks.

@Zygo said that some disk firmwares may become incredibly slow with write caching turned off, I think that mostly affects some SSDs. For hardware I am using, I came across no such disks yet. In fact, the system feels smoother with hardware write cache turned off.

# /etc/udev/rules.d/99-local.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{device/model}!="SD/MMC", RUN+="/sbin/hdparm -qW0qB254S241M254 /dev/%k"

kakra avatar Jul 21 '25 11:07 kakra

[noflushoncommit] Resulting in data loss or in filesystem corruption?

Silent data loss and out-of-order writing. The metadata will be consistent with the data as written, so you can't detect the lost data with scrub. After a crash, it will be as if some writes didn't happen, while others did, in no particular order. Overwritten data can come back, there can be holes in the middle of files, etc.

noflushoncommit gives btrfs all the worst parts of ext4's data=writeback, except no confidential information disclosure. i.e. ext4 gives you whatever data was in the disk blocks before the crash, including data from previously deleted files, while btrfs gives you a zero-filled hole or an older version of the specific data that was overwritten. (You might be able to get an information disclosure from nodatacow but that's a different topic...)

The metadata is intact in this situation. The transaction commit cycle works OK. The problem is that not all of the dirty pages in the page cache that were written before the transaction commit are included in the transaction, because noflushoncommit allows that. Those pages are deferred to the next transaction...or the one after that, or the one after that. I've measured deferrals running several hours in bad cases.

flushoncommit disallows write reordering around a transaction commit, so the transaction includes all writes up to the start of the commit.

Mounting with the default noflushoncommit can cause data written after the last completed transaction to be inconsistent if it is partially flushed with fsync().

I said that, and it's not quite correct.

  • fsync creates an inconsistency between all data written with fsync() and without fsync(). The use of flushoncommit or noflushoncommit doesn't matter. It is fsync() which creates the inconsistency, by effectively reordering writes to fsync() files ahead of writes to non-fsync() files.
    • The inconsistency can be prevented by zeroing the log tree before mount, but this solves the inconsistency by forcing all writes to be non-fsync(), i.e. wiping out all fsync() data after the last committed transaction. Consistent is not necessarily better!
    • The inconsistency can also be prevented by mounting with the sync option, but that option has disastrously bad performance. It solves the inconsistency by forcing fsync() after every write, so there's no non-fsync data to be inconsistent with.
  • noflushoncommit makes all data written without fsync() possibly inconsistent after a crash (in addition to inconsistency with fsync() data), i.e. writes are lost on both sides of the last transaction.
  • flushoncommit without fsync() is the most consistent, though it only updates the disk once per commit. i.e. up to the last 30 seconds of writes will be lost after a crash, assuming reasonable load and default commit options. The order of the writes doesn't change, so there will be some cutoff time T where all writes before T are committed while all writes after T are not.
  • fsync() in this context includes anything that uses the log tree: fdatasync, the dedupe and clone ioctls, rename, and close relatives of those functions.

It's very easy to use flushoncommit. It's somewhat harder to avoid fsync(). Tools like eatmydata can help by preventing fsync calls, but it's very difficult to avoid rename, and bees generates a lot of dedupe calls.

For a backup server using btrfs send or rsync, flushoncommit alone provides adequate guarantees, as these tools don't use fsync() and they are designed to handle what btrfs does with clone and rename.

The real problems occur when you have applications that fsync() some of their data and don't fsync() other data. Applications which do that are usually hardened against data inconsistency since it's much worse on other filesystems.

@Zygo said that some disk firmwares may become incredibly slow with write caching turned off

There's a few cases:

  • If the device is a CMR spinner that doesn't do NCQ, then disabling write cache has a negligible effect on performance unless you're doing a lot of fsync(). This is because these drives are already very slow even with write cache enabled. Disabling the write cache makes them only a little slower. Most of the drives where I've had to disable write cache fall into this category--desktop, gaming, and low-end NAS drives.
  • Some WD spinning drives (particularly the new and large ones) have write caching that can't be disabled in firmware (there are two levels, only one can be turned off). Hope the firmware works on those! The good news is that a failure at the lower write cache level can brick the drive, so the vendor has an incentive to avoid warranty claims and improve firmware quality. The bad news is that disabling the upper write cache level on these drives might make the drive much slower, with no improvement of reliability.
  • SSDs and even NVMes often have volatile write caches in DRAM. Don't assume that solid state = working firmware. Any change in firmware configuration for a SSD can have unpredictable effects. Sometimes one of these effects might be a performance decrease.

Zygo avatar Jul 21 '25 17:07 Zygo

fsync() in this context includes anything that uses the log tree: fdatasync, the dedupe and clone ioctls, rename, and close relatives of those functions.

How does notreelog,flushoncommit + fsync/dedupe/rename/... behave in comparison?

manxorist avatar Jul 22 '25 09:07 manxorist

How does notreelog,flushoncommit + fsync/dedupe/rename/... behave in comparison?

notreelog turns almost all fsync() operations into syncfs() operations, i.e. it makes almost every fsync into a full transaction commit. That would avoid inconsistency, at the expense of performance, but not as much performance as mounting with -o sync (which follows every write with a fsync() operation). notreelog,sync doesn't provide much additional benefit and is even slower than sync.

Note that I haven't tested notreelog, so I don't have any real data about how effective it is. I do notice that in some cases the notreelog option is evaluated but the result is ignored in fs/btrfs/tree-log.c (hence the qualifier almost in the above). I wouldn't rely on it without running it through some thorough testing first. It has been less than 30 days since the last known tree-log bug was fixed for fairly common workloads with default configuration, so the risk of encountering a corner case in an unusually configured workload is high.

Zygo avatar Jul 22 '25 16:07 Zygo

Ok, let me summarize. flushoncommit provides a bit more safety, but not reducing performance as much as sync. notreelog provides additional safety, but not affecting performance as much as sync. notreelog,sync would provide the most safety and most performance impact.

So just adding flushoncommit could be a workaround on weak hardware?

Within few weeks I ran again into a "bad tree block" failure on an NVMe USB bridge due to a power loss. Of course the device was running some balancing or bees process at that time. https://lore.kernel.org/linux-btrfs/[email protected]/T/

# lsusb |grep Storage
Bus 002 Device 002: ID 0bda:9220 Realtek Semiconductor Corp. Ugreen Storage Device

# fdisk -l /dev/sdb |grep model
Disk model:  SN850X 4000GB 

# hdparm -W /dev/sdb
/dev/sdb:
 write-caching = not supported

The machine also has more btrfs on local NVMe and HDD. These drives where not damaged yet on power losses. So I guess the USB bridge is not optimal.

Realtek RTL9220 is reported to be suboptimal for btrfs, not supporting discard/trim, buggy write cache etc. true? Would some of these perform better? ASM2362, JMS578, JMS583? Would flushoncommit be a workaround on this RTL chipset?

Massimo-B avatar Jul 26 '25 10:07 Massimo-B

So just adding flushoncommit could be a workaround on weak hardware?

There is no btrfs option that can help with weak hardware, i.e. hardware with firmware that performs incorrect write ordering. You can only fix that by upgrading the firmware or switching to a different model (one without a write-order bug).

There is one btrfs option that can make things worse: nobarrier prevents btrfs from issuing write order commands to the drive. This is useful if you don't need the filesystem to survive a reboot, e.g. for some use case where you'd use tmpfs, but you need compression, reflinks, or snapshots, and you don't want it to be slowed down by superblock writes. nobarrier won't survive a reboot because it tells the drive it can write in any order it wants to, even if that would destroy the filesystem.

The effect of weak hardware on btrfs is similar to nobarrier: the hardware reorders writes, even when btrfs tells it not to. The end result is the same: filesystem destroyed, because the device failed to write metadata page updates, but did not fail to update the superblocks--a forbidden rearrangement of write order.

All the flushoncommit etc. options do is change what file data btrfs includes in tree updates (and which trees store the updates). None of those options will allow a metadata tree to be incomplete or incorrect. They only select which data is included in the tree update.

..."bad tree block" failure...

bad tree block comes from the hardware damaging the tree by omitting or corrupting writes. Also parent transid verify failed, bad tree level, page csum failure, and wrong metadata uuid. They are usually indicators of hardware failure including firmware bugs (although there are some exceptions).

Realtek RTL9220 is reported to be suboptimal for btrfs, not supporting discard/trim, buggy write cache etc. true? Would some of these perform better? ASM2362, JMS578, JMS583?

Impossible to tell. It will depend on the chipset and what firmware it's running--which is determined by the vendor that made the board, not the chip. Working chipsets can be compromised by embedded RAM failure, board design errors, and power supply quality.

You can use btrfs to determine which chipsets work. Put one drive on each chipset, and create a btrfs raid1 of the two drives. If the chips have several ports or port multipliers, use mdadm to create a raid0 array of all the drives on the same controller, then make a btrfs raid1 of the two mdadm raid0 devices.

That way, if controller 1 fails, btrfs restores the corrupted data from the copy on controller 2, and btrfs tells you which controllers are bad and which are good. Replace bad controllers with different models whenever btrfs reports an error. When btrfs stops reporting errors, you have found two working controllers.

Would flushoncommit be a workaround on this RTL chipset?

No, because flushoncommit has no effect on metadata integrity. The only fix for a bad bridge chipset is to switch to a good chipset.

Zygo avatar Jul 27 '25 03:07 Zygo

Why did I use the bridge for a while without issues and only had these issues on power loss incidents? If wrong write order is the bug, then the bridge should continuously lead to corruption? Doesn't that mean the device has a bad buffering on power loss, or is that up to the NVMe itself?

Nice approach with the RAID set. Why doing the RAID1 with btrfs and not mdadm? Why do I need to use all port multipliers as RAID0 and not only using one of the ports or doing a RAID1 over all discs on all chipsets A and B on all ports like A1, A2, B1, B2? So for testing I could use my internal approved and safe disk + my USB test target as RAID1? But I still need a reproducible test leading to the corruption, like for now this is only a power loss.

Massimo-B avatar Jul 27 '25 10:07 Massimo-B

If wrong write order is the bug, then the bridge should continuously lead to corruption?

Because reordering is only an issue when some later writes are lost; such reordering doesn't change the semantics as long as they are all completed.

lilydjwg avatar Jul 27 '25 11:07 lilydjwg

Why doing the RAID1 with btrfs and not mdadm?

mdadm raid1:

  • won't correct errors on the corrupted drive/bridge side
  • will copy errors from the bad side to the good side 50% of the time
  • won't inform you which side is the bad one

With mdadm you'll see corrupted data if you check for it, but have no way to know which device caused the problem. btrfs raid1 reports the failing side, and if an intact copy of the data is available, corrects the failing side from the good side.

This means you can recover if you build a system using (at most) one bad device, without having to mkfs and restore from backups every time you identify a bad device. Indeed, this setup is a reasonable build for production use, so even if you're running vetted you can be verifying your hardware's correctness for the entire time you use it. This continuous verification is important for detecting hardware failures such a failing power supply, which can make good hardware behave badly.

Why do I need to use all port multipliers as RAID0 and not only using one of the ports or doing a RAID1 over all discs on all chipsets A and B on all ports like A1, A2, B1, B2?

Port multipliers put multiple drives behind a single SATA interface. If one of the drives fails, it can disrupt all other drives on the same interface, leading to a multiple drive failure event that btrfs raid1 can't recover from.

If there's only one drive on a port multiplier then it doesn't matter, since the failure domain contains only one device. If you have two drives on a port multiplier, btrfs must consider them to be a single device, which means aggregating the drives together at the block layer. You can do that with non-redundant mdadm raid profiles or LVM LVs.

Put another way: btrfs raid1 can recover from one device failure at the btrfs level, so you must arrange the devices you're testing so that btrfs sees only two devices.

If you have 3 or 4 devices to test at once, you can arrange them as 3 or 4 btrfs devices with raid1c3 or raid1c4, respectively; however, this is not very economical to run as a production setup.

If you're running a test setup and you only care about detection, not recovery, you can put all the devices in btrfs raid0. This will force btrfs to use all the devices, and it can detect and report failing devices, but it won't be able to recover the data. It's something you'd do if you had e.g. 4 chipset suppliers, and you wanted to know which ones to buy 100 units of.

Zygo avatar Jul 27 '25 20:07 Zygo

Doesn't that mean the device has a bad buffering on power loss, or is that up to the NVMe itself?

Either could cause the issue by itself, and they could both be bad.

You'd have to get two (or more) bridge models and two drive models, and test all the combinations. If the drive is bad, no bridge can help with it. If the drive is OK but the bridge is bad, you can make the drive work by swapping out the bridge.

Zygo avatar Jul 27 '25 21:07 Zygo

mdadm raid1: * won't correct errors on the corrupted drive/bridge side * will copy errors from the bad side to the good side 50% of the time

Is that, because only btrfs knows the correct write order and detects a misordering? That would mean, only btrfs RAID1 is recommended if mdadm can't detect this kind of errors.

This means you can recover if you build a system using (at most) one bad device

Why does recovering only work with maximum 1 bad device? If raid0 on 4 devices can also detect 3 bad devices, why can't raid1 on 4 devices with 3 bad not recover from the 1 good device?

Getting a write order bug corrupting the filesystem would only happen accidentally when unplugging power (unplugging USB) while writing? And I would need to repeat it several times to be sure a device is good? Because only an interrupted incomplete write action would reveal the wrong order, it can't be detected at once while writing?

Massimo-B avatar Jul 28 '25 08:07 Massimo-B

Why does recovering only work with maximum 1 bad device? If raid0 on 4 devices can also detect 3 bad devices, why can't raid1 on 4 devices with 3 bad not recover from the 1 good device?

btrfs RAID1 stores only 2 mirrors of data, and distributes them to 4 devices. Thus, just one device might go bad before you lost data. But there's RAID1c3 or RAID1c4 which mirror data to 3 or 4 respectively.

kakra avatar Jul 28 '25 08:07 kakra

Why is btrfs more affected by these bugs than other filesystems?

Massimo-B avatar Jul 28 '25 11:07 Massimo-B

Why is btrfs more affected by these bugs than other filesystems?

Because btrfs screams when an error is detected and it can detect via checksums and other means. Ext4 tries hard to recover from corrupted metadata and give out corrupted data instead of erroring out. (You are asked to run fsck when ext4 can't recover itself.)

Quote from a friend: with ext4 on an SD card, every time the system booted up several files were lost; it was working until one day libc.so.6 disappeared.

lilydjwg avatar Jul 28 '25 11:07 lilydjwg

Ok, because in order to get refund for all purchased faulty bridges, the support says, if it works with exFat, then it's not broken :)

Massimo-B avatar Jul 28 '25 11:07 Massimo-B

Is that, because only btrfs knows the correct write order and detects a misordering?

btrfs detects lost writes and corrupted data in general, and does not rely on devices for this detection. btrfs can detect device errors after they occur.

mdadm detects only those IO errors that are explicitly reported by the block device, and only at the time when the errors occur.

That would mean, only btrfs RAID1 is recommended if mdadm can't detect this kind of errors.

That is correct in the general case. Whenever a straight choice between btrfs raid1 and mdadm raid1 or hardware raid1 exists, always use btrfs raid1.

This means you can recover if you build a system using (at most) one bad device

Why does recovering only work with maximum 1 bad device? If raid0 on 4 devices can also detect 3 bad devices, why can't raid1 on 4 devices with 3 bad not recover from the 1 good device?

If the bridge chip supports multiple downstream devices, and the bridge chip is bad, it will cause corruption on all connected devices at the same time. btrfs can only recover from one failure at a time with raid1. In this arrangement, multiple btrfs devices fail and the filesystem cannot be recovered.

  • bad-bridge - btrfs-device-1(sda), btrfs-device-2(sdb), btrfs-device-3(sdc), btrfs-device-4(sdd)
  • good-bridge - btrfs-device-5(sde), btrfs-device-6(sdf), btrfs-device-7(sdg), btrfs-device-8(sdh)

In the above setup, when bad-bridge fails, there are simultaneous failures on btrfs devices 1 through 4. This is not recoverable, even with raid1c4.

Using raid0 allows all the downstream devices to be used. If the bridge chip fails, the corruption still happens on all connected devices at the same time; however, btrfs sees all of them as a single device, so only one device has failed from btrfs's point of view. If you have another bridge chip with another raid0 on it, btrfs can recover from that.

  • btrfs-device-1 - bad-bridge - md-raid0(sda, sdb, sdc, sdd)
  • btrfs-device-2 - good-bridge - md-raid0(sde, sdf, sdg, sdh)

In the above setup, when bad-bridge fails, only btrfs device 1 fails. btrfs restores the lost data from btrfs-device-2.

Note: You'd only want to use these setups if you can't provide one bridge per device. When you have one bridge per device, their failures are isolated and independent. When there's multiple devices behind a single bridge, then the device failures aren't isolated from the bridge failure. All the devices depend on a common bridge component for correct operation, and if there's a port multiplier, all devices sharing the port can interfere with each other.

Getting a write order bug corrupting the filesystem would only happen accidentally when unplugging power (unplugging USB) while writing?

They can also happen during bus timeouts or resets which in turn can be triggered by UNC error handling.

And I would need to repeat it several times to be sure a device is good? Because only an interrupted incomplete write action would reveal the wrong order, it can't be detected at once while writing?

Yes. Most devices that have these bugs will fail in under 10 tries, so it's only necessary to set up a test like:

  • continuously update a huge tree of many files
  • wait for write IO to saturate the device
  • kill power
  • repeat 10x

If the filesystem is still mountable after 10 tries, the hardware is probably good. Bad hardware fails very quickly.

Why is btrfs more affected by these bugs than other filesystems?

It isn't. ext4 in legacy-free mode has similar vulnerabilities in its extent trees. ZFS has even stricter requirements than btrfs for devices that are to be used for its SLOG.

Legacy filesystems can survive these failures because they never provided data integrity guarantees in the first place. Data corruption after a crash is part of the filesystem spec. Metadata is stored in fixed locations and there are no conflicting sources of reference information, so all a recovery tool has to do is delete data until the filesystem is consistent again.

Databases like postgresql do have data integrity guarantees, and multiple sources of reference information, and they can be unrecoverably corrupted. Anything that relies on fsync to work can't run properly on these devices, because it's not possible to implement fsync on a device that does not respect write order constraints.

btrfs doesn't overwrite anything but the superblocks, so there are no fixed locations for metadata. btrfs relies on trees of pointers to tell it where all of the metadata is. If a drive fails to write a tree out completely, then the filesystem can only be recovered by brute force data recovery methods: searching the entire drive for filesystem metadata pages, figuring out which ones hold current data, filling in missing pages by inference, and assembling them back into a coherent whole. This is much more complex than "delete data until the filesystem is consistent again."

Also, once btrfs has been informed by the drive that new data has been written, btrfs will overwrite or discard old metadata that is no longer valid. So you can't even find old metadata by searching for it--if the drive lied about writing the metadata, then the metadata is gone.

Ok, because in order to get refund for all purchased faulty bridges, the support says, if it works with exFat, then it's not broken :)

When purchasing at the low end, you're getting a discount because there's no certification of correctness by the vendor. Buy 5 board models, get 4 that work, order more of the 4 models that work, and put the other one in a desk drawer, never to be used in production. The cost of the devices that end up in the drawer is the price paid to obtain quality.

Because there's currently no certification requirement, and because low-end hardware is the storage industry's scheme to transfer the burden of e-waste disposal to its customers, it's not currently possible to reliably order a working model from spec, or claim warranty on models that don't work. Consumers are left to discover which devices work by themselves.

Zygo avatar Jul 28 '25 16:07 Zygo

If the bridge chip supports multiple downstream devices, and the bridge chip is bad, it will cause corruption on all connected devices at the same time. btrfs can only recover from one failure at a time with raid1.

But I meant having at least 1 good device + 4 on the bad bridge, having a raid1 with 5 disks. You mean if all 4 are failing simultaniously, raid1 can't recover from the 1 good device?

Consumers are left to discover which devices work by themselves.

Is there a trivial "how to reproduce" on exFat, that shows wrong write order on power loss? I guess not, if not doing the raid test with sophisticated filesystems ensuring data integrity.

Massimo-B avatar Jul 29 '25 08:07 Massimo-B

btrfs detects lost writes and corrupted data in general,

How could I test a new device, if not creating a raid? Just doing the power drop several times during write process and see if next RW mount has no errors?

Massimo-B avatar Jul 29 '25 08:07 Massimo-B

Especially cheap USB sticks may be optimized to ensure integrity of the data blocks at the typical location of the FAT only, and also only do proper wear leveling in that area of the storage. This is often exploited by fake capacity clones: The FAT table looks fine, and also directories look fine, but actual file data is stored in some sort of ring buffer and you lose data by just writing data. FAT doesn't detect this: It has no measure of ensuring file content integrity, like most other file systems (many Unix filesystems included).

You are seeing more problems with btrfs just because it is able to detect these problems. Additionally, btrfs is more complex at writing data which makes write ordering issues more prominent, but that's not part of the inherent problem.

But I meant having at least 1 good device + 4 on the bad bridge, having a raid1 with 5 disks. You mean if all 4 are failing simultaniously, raid1 can't recover from the 1 good device?

On a btrfs raid1 with 5 disks, data is ensured to be mirrored to exactly two different disks of the set. Killing any more than 1 disk will kill your data, no recovery possible. This is why @Zygo recommended to combine disks on the same disk into one virtual disk via MD first, and then attach it to btrfs. That way you can ensure that one side of the mirror always ends up on the single good disk.

kakra avatar Jul 29 '25 08:07 kakra

My fault, I thought raid1 on 5 disks has 5 copies, but only has 2. Is that special about btrfs raid1? Ok, let's say raid1c3 with 1 known-good and 2 maybe-bad, or raid1c4 with 1 good and 3 bad would work to test 2 or 3 devices?

Massimo-B avatar Jul 29 '25 10:07 Massimo-B

My fault, I thought raid1 on 5 disks has 5 copies, but only has 2. Is that special about btrfs raid1?

Yes, btrfs raid1 cuts capacity in half. Hardware RAID implementations provide the capacity of the smallest disk. So btrfs raid1 acts somewhere between raid5 and raid1 compared to a standard RAID controller.

My fault, I thought raid1 on 5 disks has 5 copies, but only has 2. Is that special about btrfs raid1? Ok, let's say raid1c3 with 1 known-good and 2 maybe-bad, or raid1c4 with 1 good and 3 bad would work to test 2 or 3 devices?

n mirrors allow for (n-1) devices to fail before you lose data. Thus, c3 allows two missing devices, c4 allows three missing devices.

This also means: raid1c3 cuts capacity 1:3, raid1c4 cuts capacity 1:4.

kakra avatar Jul 29 '25 13:07 kakra

My fault, I thought raid1 on 5 disks has 5 copies, but only has 2. Is that special about btrfs raid1?

The original paper on raid, which defines the level numbers only considers 2-disk cases for raid1. The rest of the paper is about methods to reduce the extravagant expense of having two complete copies, and arrangements that are more expensive are not considered. Also note that the paper does not consider raid0, jbod, raid10, or raid6. Those terms arose later.

So "RAID level 1", or "RAID1", has exactly 2 copies on exactly 2 devices.

Linux mdadm is a notable exception to this rule--its "level 1" replicates a copy across every device, no matter how many are provided. mdadm is the tool most people encounter first, so its quirks have distorted the language.

btrfs is also a notable exception, but it goes the other way: strictly 2 copies, but unlimited devices in something like a JBOD arrangement. btrfs doesn't do RAID at the device level, so the terminology of btrfs doesn't translate to device-oriented implementations (e.g. most implementations can't efficiently use 3 different-sized devices with 2 copies).

Technically, neither mdadm nor btrfs should be called raid1, but we are stuck with the names now.

Zygo avatar Jul 29 '25 16:07 Zygo

btw. I always have a LUKS device between hardware and btrfs, does that matter for these tests? I'm testing currently with a single USB device, doing a large btrfs send/receive and just unplugging USB during that operation. Then see if btrfs mounts again without errors.

I have done that 5 times now and was not able to reproduce on the known-buggy bridge. To be honest the recent crash was a power loss of the complete machine, not only USB disconnection. Maybe the bug is only triggered that way.

Massimo-B avatar Aug 01 '25 09:08 Massimo-B

Did you disable cache flushing, either via mount option or hdparm? I disabled write caching for that matter and never had a problem again.

kakra avatar Aug 01 '25 13:08 kakra

hdparm -W says write-caching = not supported. The only difference was, when crashing I had the option nodiscard set. Now for testing I just do:

# cryptsetup luksOpen --allow-discards --persistent /dev/sdb3

# mount -o compress=zstd:9,subvol=/ /dev/mapper/mobiledata_crypt /mnt/dummy

# mount |grep dummy
/dev/mapper/mobiledata_crypt on /mnt/dummy type btrfs (rw,relatime,compress=zstd:9,ssd,space_cache=v2,subvolid=5,subvol=/)

First I like to be able to reproduce the error on my previously broken bridge before comparing with after a firmware upgrade or new bridges to be sure I solve the issue I guess I have.

Massimo-B avatar Aug 01 '25 14:08 Massimo-B

hdparm -W says write-caching = not supported.

Yeah, this is probably because your USB-SATA bridge doesn't know how to handle the query ("not supported" doesn't mean "disabled", neither does it mean "unused", it just means "I don't know what you asked for" - much the same way as "supported" doesn't mean "enabled" or "disabled"). You could attach the disk directly to SATA once, then disable write caching via smartctl and save it to the device permanently.

kakra avatar Aug 01 '25 16:08 kakra

You could attach the disk directly to SATA once, then disable write caching via smartctl and save it to the device permanently.

It's a PCIe NVMe. So you mean, write caching setting is permanently stored on the NVMe?

How could I test a new device, if not creating a raid? Just doing the power drop several times during write process and see if next RW mount has no errors?

Is this a valid equivalent test to detect faulty bridges like your proposal using a RAID1?: Using a single data device and unplugging USB, then remounting and see if it's mounted RW. Or do I need a scrub to detect a corruption in this scenario?

Massimo-B avatar Aug 07 '25 13:08 Massimo-B

It's a PCIe NVMe. So you mean, write caching setting is permanently stored on the NVMe?

smartctl has an option to store the current settings permanently on the device. If it works for your disk? I don't know. Reboot, and see if it reverted to defaults or kept the setting. Then put it back into the USB adapter. Doing this via the adapter directly probably won't work because the adapter doesn't seem to understand all the relevant protocols.

kakra avatar Aug 07 '25 13:08 kakra