bcachefs icon indicating copy to clipboard operation
bcachefs copied to clipboard

If I remove a drive bcachefs thinks its still there with bcachefs fs usage and writing new data.

Open b-r-o-w-n opened this issue 4 years ago • 6 comments

kernel: 5.7.0-g6288f1b60 bcachefs tool version v0.1-227-g21ade39

The goal of this test is to simulate a 2 drive mirror and see how to recover when one of the drives fails.

I have 2 drives, /dev/sda and /dev/sdb.

/usr/src/bcachefs-tools/bcachefs format --group hdd /dev/sd[ab] --replicas=2 --data_replicas_required=2 --metadata_replicas_required=2 mount -t bcachefs -o noatime /dev/sdb:/dev/sda /mnt/bcachefs1

copy some data to the fs

/usr/src/bcachefs-tools/bcachefs fs usage /mnt/bcachefs1/ Filesystem 93838812-d13d-42bd-9a44-c9a55a3a83e9: Size: 644144683008 Used: 539501056 Online reserved: 7168

Data type Required/total Devices btree: 1/2 [sda sdb] 2097152 data: 1/2 [sda sdb] 1024

hdd (device 0): sda readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 <--- 1 512 sector/block cached: 0 0 0 erasure coded: 0 0 0 available: 494840578048 1887667 capacity: 500107837440 1907760

hdd (device 1): sdb readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 <--- 1 512 sector/block cached: 0 0 0 erasure coded: 0 0 0 available: 197780832256 754474 capacity: 200049426432 763128

Pull /dev/sdb

/usr/src/bcachefs-tools/bcachefs fs usage /mnt/bcachefs1/ Filesystem 93838812-d13d-42bd-9a44-c9a55a3a83e9: Size: 644144683008 Used: 539501056 Online reserved: 7168

Data type Required/total Devices btree: 1/2 [sda sdb] 2097152 data: 1/2 [sda sdb] 1024

hdd (device 0): sda readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 cached: 0 0 0 erasure coded: 0 0 0 available: 494840578048 1887667 capacity: 500107837440 1907760

hdd (device 1): sdb readwrite <---- still thinks it here..... a ghost drive now. data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 cached: 0 0 0 erasure coded: 0 0 0 available: 197780832256 754474 capacity: 200049426432 763128

[ 774.380992] ata6: SATA link down (SStatus 0 SControl 300) [ 780.109789] ata6: SATA link down (SStatus 0 SControl 300) [ 785.741859] ata6: SATA link down (SStatus 0 SControl 300) [ 785.741868] ata6.00: disabled [ 785.741895] ata6.00: detaching (SCSI 5:0:0:0) [ 785.742656] sd 5:0:0:0: [sdb] Synchronizing SCSI cache [ 785.742729] sd 5:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=0x00 [ 785.742732] sd 5:0:0:0: [sdb] Stopping disk [ 785.742744] sd 5:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=0x00

add a new drive

[ 1409.578001] ata5: link is slow to respond, please be patient (ready=0) [ 1409.734013] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 1409.742356] ata5.00: configured for UDMA/133 [ 1411.764958] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 1411.768085] ata6.00: ATA-7: WDC WD1500ADFD-00NLR5, 21.07QR5, max UDMA/133 [ 1411.768090] ata6.00: 293046768 sectors, multi 0: LBA48 NCQ (depth 32), AA [ 1411.772024] ata6.00: configured for UDMA/133 [ 1411.772182] scsi 5:0:0:0: Direct-Access ATA WDC WD1500ADFD-0 7QR5 PQ: 0 ANSI: 5 [ 1411.772577] sd 5:0:0:0: [sdd] 293046768 512-byte logical blocks: (150 GB/140 GiB) [ 1411.772591] sd 5:0:0:0: [sdd] Write Protect is off [ 1411.772595] sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00 [ 1411.772614] sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 1411.780223] sd 5:0:0:0: [sdd] Attached SCSI disk

fs usage still sees the empty new drive as present... and full of data.... /usr/src/bcachefs-tools/bcachefs fs usage /mnt/bcachefs1/ Filesystem 93838812-d13d-42bd-9a44-c9a55a3a83e9: Size: 644144683008 Used: 539501056 Online reserved: 7168

Data type Required/total Devices btree: 1/2 [sda sdb] 2097152 data: 1/2 [sda sdb] 1024

hdd (device 0): sda readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 cached: 0 0 0 erasure coded: 0 0 0 available: 494840578048 1887667 capacity: 500107837440 1907760

hdd (device 1): sdb readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 <----------- bogus! cached: 0 0 0 erasure coded: 0 0 0 available: 197780832256 754474 capacity: 200049426432 763128

add the new drive to the array /usr/src/bcachefs-tools/bcachefs device add /mnt/bcachefs1 /dev/sdd

/usr/src/bcachefs-tools/bcachefs fs usage /mnt/bcachefs1/ Filesystem 93838812-d13d-42bd-9a44-c9a55a3a83e9: Size: 782181198848 Used: 1076896256 Online reserved: 7168

Data type Required/total Devices btree: 1/2 [sda sdb] 2097152 data: 1/2 [sda sdb] 1024

hdd (device 0): sda readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 cached: 0 0 0 erasure coded: 0 0 0 available: 494840578048 1887667 capacity: 500107837440 1907760

hdd (device 1): sdb readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 cached: 0 0 0 erasure coded: 0 0 0 available: 197780832256 754474 capacity: 200049426432 763128

(no label) (device 2): sdd readwrite data buckets fragmented sb: 135168 1 389120 journal: 536870912 1024 0 btree: 0 0 0 data: 0 0 0 <----- empty cached: 0 0 0 erasure coded: 0 0 0 available: 148019085312 282324 capacity: 150039691264 286178

I have 2 real drives and one ghost drive. The new drive has no data on it...

now lets replicate the data /usr/src/bcachefs-tools/bcachefs data rereplicate /mnt/bcachefs1 0% complete: current position none Done

/usr/src/bcachefs-tools/bcachefs fs usage /mnt/bcachefs1/ Filesystem 93838812-d13d-42bd-9a44-c9a55a3a83e9: Size: 782181198848 Used: 1076896256 Online reserved: 7168

Data type Required/total Devices btree: 1/2 [sda sdb] 2097152 data: 1/2 [sda sdb] 1024

hdd (device 0): sda readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 cached: 0 0 0 erasure coded: 0 0 0 available: 494840578048 1887667 capacity: 500107837440 1907760

hdd (device 1): sdb readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 512 1 261632 cached: 0 0 0 erasure coded: 0 0 0 available: 197780832256 754474 capacity: 200049426432 763128

(no label) (device 2): sdd readwrite data buckets fragmented sb: 135168 1 389120 journal: 536870912 1024 0 btree: 0 0 0 data: 0 0 0 <-------- no data replicated... because the ghost drive is ok! cached: 0 0 0 erasure coded: 0 0 0 available: 148019085312 282324 capacity: 150039691264 286178

Lets try an experiment... write more data... this should replicate to what????

/usr/src/bcachefs-tools/bcachefs fs usage /mnt/bcachefs1/ Filesystem 93838812-d13d-42bd-9a44-c9a55a3a83e9: Size: 782181198848 Used: 1075855360 Online reserved: 14336

Data type Required/total Devices btree: 1/1 [sda] 1048576 data: 1/1 [sda] 512 data: 1/2 [sda sdb] 1024

hdd (device 0): sda readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 1048576 4 0 data: 1024 2 523264 <---- more data here.... cached: 0 0 0 erasure coded: 0 0 0 available: 494840578048 1887667 capacity: 500107837440 1907760

hdd (device 1): sdb readwrite data buckets fragmented sb: 135168 1 126976 journal: 268435456 1024 0 btree: 0 0 0 data: 512 1 261632 <--- this cannot change... its not there... cached: 0 0 0 erasure coded: 0 0 0 available: 197781880832 754478 capacity: 200049426432 763128

(no label) (device 2): sdd readwrite data buckets fragmented sb: 135168 1 389120 journal: 536870912 1024 0 btree: 0 0 0 data: 0 0 0 <----- this is bad.... we are not creating two copies... cached: 0 0 0 erasure coded: 0 0 0 available: 148019085312 282324 capacity: 150039691264 286178

and we got i/o errors when trying to write to /dev/sdb [ 1565.372383] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for superblock write: I/O [ 1738.763400] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for superblock write: I/O [ 1738.764456] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 1738.764458] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 1738.781505] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for superblock write: I/O [ 1738.798399] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for superblock write: I/O [ 1738.799327] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 1738.799330] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 1738.813349] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for superblock write: I/O [ 2091.788248] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for btree write: I/O [ 2091.789071] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for superblock write: I/O [ 2091.997492] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for btree write: I/O [ 2092.724420] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for btree write: I/O [ 2092.761851] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 2092.761856] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 2093.913824] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 2093.913828] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 2098.606206] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for data write: I/O [ 2098.608523] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for superblock write: I/O [ 2098.631960] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 2098.631965] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 2098.748331] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for btree write: I/O [ 2099.673832] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O [ 2099.673836] bcachefs (93838812-d13d-42bd-9a44-c9a55a3a83e9): IO error on sdb for journal write: I/O

which is expected... its not there...

b-r-o-w-n avatar Aug 04 '20 22:08 b-r-o-w-n

oh this is a really cool test, thanks for pointing this out!

so what we're currently lacking is a way for the filesystem to get notified when that device gets yanked, as you noticed. you should be able to manually set it to failed with bcachefs device set-state.

It's debatable whether we should be setting the device to failed from within the kernel, as flaky SATA connections are definitely a thing... we'd also need to handle the device coming back.

it also appears part of the per-device IO error path is missing, when we get too many IO errors from a device we should be setting it to failed. I'll get on fixing that.

koverstreet avatar Aug 13 '20 17:08 koverstreet

Maybe if SATA is dropping out too often it's best to set it to failed.

A heuristic that can detect that, and maybe a state of "ejected" could also be considered, that bypasses that heuristic and lets you remove said device momentarily for inspection (like replacing a cable).

YellowOnion avatar Mar 24 '22 00:03 YellowOnion

Hi,

I responded to his problem via e-mail. The response never made it here.

My suggestion was to look at what btrfs does. Why reinvent the wheel!

I decided to take a look myself.

Btrfs does detect when a drive is removed. It ends up setting the Device size to zero. (see btrfs device usage ...)

When you remove a drive and no longer have enough media to support the raid level nothing much happens till you try and write data to the fs. This triggers an internal fault since it cannot create(in this example) the second copy.

So, you might be able to use what they do to save some work.

b-r-o-w-n avatar Mar 24 '22 19:03 b-r-o-w-n