zfs Data loss due to resilver after split brain

trafficstars

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	11
Kernel Version	5.10.0-21
Architecture	amd64
OpenZFS Version	2.0.3-9

Describe the problem you're observing

In a pool consisting of only one mirror of two drives, after a split brain situation, a resilver can overwrite the "more recent" copy.

My pool consists of a mirror of two disks.

Yesterday, disk 1 had too many I/O errors and was offlined by ZFS.

Today, disk 2 failed (disappeared from the ATA bus), suspending the pool.

I rebooted the server.

The pool came back up in degraded state with only disk 1 online (since disk 2 wasn't recognized by the system anymore).

I replugged disk 2. It was recognized again.

I onlined disk 2 again. Resilvering started to copy data from disk 1 to disk 2 without further warning. All data that had been written in that day to disk 2 was gone.

Describe how to reproduce the problem

fallocate -l 256M /tmp/disk1
fallocate -l 256M /tmp/disk2
zpool create test mirror /tmp/disk[12]
date > /test/file1
zpool offline test /tmp/disk1
date > /test/file2
zpool export test
mkdir /tmp/offline
mv /tmp/disk2 /tmp/offline/
zpool import -d /tmp/disk1 test
mv /tmp/offline/disk2 /tmp/
zpool online test /tmp/disk2
ls /test

Although disk2 would have had file2 on it, it's gone now.

Remarks

I don't have much knowledge about ZFS internals. Just looking at the output of zdb -l, this probably could have been detected by zpool online in two different ways:

The TXG of disk2 is greater than that of disk1. This could probably change, however, if I do enough stuff to disk1 while disk2 is still gone.
The vdev_tree on disk2 lists disk1 as offline, but a device with the same guid is online in the pool when onlining disk2.

I'm not even expecting some magic to happen to somehow merge the two disks. I just expect a (possible) split-brain scenario to be detected and a big fat warning to appear before resilvering over data.

Mar 16 '23 20:03 jplitza

good find!

i'd second that.

there should be protection for such scenario and i would have expected that this would exist.

tested on zfs-2.1.9-pve1 and the issue also exists there

Mar 18 '23 15:03 devZer0

The vdev_tree on disk2 lists disk1 as offline, but a device with the same guid is online in the pool when onlining disk2.

This would be my favorite. But the detection of this within a single mirror is fairly simple. When we catch this.... but we should also catch other configurations and maybe write tests for that.

Could you write some possible testing scenarios and posible solutions? Maybe we can put together these in some pull request with a nice solution. In the end, we could have some big fat warning before resilvering ;-)

Mar 27 '23 19:03 mcmilk

As I said, I'm not familiar with ZFS' internals. After playing around a bit more, I'm even guessing that the "offline" flag is caused by using zpool offline and might not be set when the device disappears.

What I can say is that in a pool with 2 mirrors, the same trick executed in just one mirror doesn't work: The import with the disk that was offlined won't work. Maybe it's because the most recent "Uberblock" (zdb -lu) doesn't match that of the other disks that were online at all times. I wasn't able to easily offline one disk from each mirror at the exact same moment (zpool offline can take multiple devices, but offlines them sequentially, causing them to have different Uberblocks as well and hence not importing cleanly), but I guess it would cause the same problem as with a single mirror.

In a mirror with more than 2 drives, the same problem can occur, too.

And you could probably create similar test cases with 2-disk raidz, 3- or 4-disk raidz2 and 4- to 6-disk raidz3, but why would anybody use that?

Probably device-mapper could somehow help with virtually pulling the plug of multiple devices (in the same instant, and as opposed to using zpool offline and hence setting the offline flag). Maybe I'll come around to play with that next week.

Mar 29 '23 18:03 jplitza

During pool import ZFS just scans all the disks available for configuration with the biggest TXG number. If the only disk with the newer data is not available, then how can ZFS guess that it was there or if it is going to be? And after the pool is already imported read/write, what can it do with the "newer" disks? There is no good choice there, because by the time the "old" disk already received some writes and its TXGs don't match the original ones on the "new" disk. If you are concerned about this scenario, you should add more disks into the configuration, in which case you'd need to loose much more disks same time to reproduce this issue and it is unlikely that the pool will be importable in some old state after that.

BTW: It may be not what Linux people prefer, but in TrueNAS SCALE we explicitly delay boot and pools import until ALL hardware detection is complete. It gives us more trust that we really import the most up to date pool configuration.

Mar 31 '23 01:03 amotin

If the only disk with the newer data is not available, then how can ZFS guess that it was there or if it is going to be?

At that point, nothing is wrong with ZFS' behavior. It imported the pool just fine with the single disk that was available, and that's what I would expect.

And after the pool is already imported read/write, what can it do with the "newer" disks? There is no good choice there, because by the time the "old" disk already received some writes and its TXGs don't match the original ones on the "new" disk.

As I said, I don't expect ZFS to do "magic" an fix everything. It'd just be nice if it could detect and warn about this situation.

And if there really is no way to detect such a situation (or to add on-disk metadata to make it possible), it would at least be nice to document the possibility of data loss due to a seemingly harmless zfs online operation (and probably others?).

Apr 04 '23 13:04 jplitza

I have zero knowledge of ZFS internals, just my $0.02 but, ... ZFS likely reacts to status events from the drive, and marks a device reporting bad health or data as offline. Any opportunity to mark a device has likely already passed.

ZFS can approach this in a reverse manner: the currently online device(s) can be immediately marked 'clean' in a post-split scenario, such that any device re-entering an array or mirror lacking a 'clean' flag be interpreted as 'dirty'. ( Question, under what conditions only can a device be marked 'clean'? )

ZFS can then integrate the 'dirty' device into the array of 'clean' devices.

Failure here, and somehow landing in a split-brain scenario, mean that devices marked as 'clean' (or "have more recent data") stand out from any devices currently active which lack any 'clean' flag i.e. are 'dirty'.

This is how I might handle this scenario. Tools and behaviour need to be able to recover from various scenarios when a user knows some opposite constellation is how things should be: e.g. somehow ZFS reports drive X in array A is good (maybe the user took the drive, stuck it in another identical system, got it working, pulled out its mirror, this drive got the clean flag...), but the user knows the opposite to be true, i.e. drive Y is good and X is old, despite what the data says. This is the exception, perhaps. Most data loss occurs due to user intervention...

A boolean can be used in a ternary fashion: clean, none or dirty. ( Or fresh, none, stale )

Following your sequence of events above, what might happen?

Device A fails, offlined - device B marked 'clean'.
User goes "okay, shit, time to check the SMART values on A and maybe get a new drive/comp/etc"
Device B disappears
Reboot, user starts getting nervous
Maybe some other SAS/SATA bus BIOS shenanigans, user re-configures stuff, general chaos
System comes up with device A online - ZFS recognizes array, whose member is A, but A lacks 'clean' flag
ZFS zpool tools or bootlog inform user of current state
User yoink disconnects B, reconnects B, ZFS goes booyah B came back
ZFS can now act automatically since B has 'clean' flag, and pending health checks on B, A can be resilvered.

At what point does the user decide, or ZFS decide that new data on A while the user was getting the system back up, should be lost in a resilver from B? According to the above, ZFS decides automatically to overwrite A, but perhaps presenting a snapshot of A to the array for some later user-intervention to recover "A newer than B" content, is viable.

Apr 06 '23 21:04 systemcrash

zfs zfs copied to clipboard

Data loss due to resilver after split brain

System information

Describe the problem you're observing

Describe how to reproduce the problem

Remarks

zfs
zfs copied to clipboard