Not able to mount with -o degraded when a disk is missing after hardware failure
I have an 8-disk array and after one of my disks died suddenly I'm no longer able to mount it since /dev/sdh no longer exists:
❯ sudo bcachefs mount -v -o degraded,errors=remount-ro /dev/sda:/dev/sdb:/dev/sdc:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg /mnt/storage
DEBUG - bcachefs::commands::mount: Walking udev db!
INFO - bcachefs::commands::mount: mounting with params: device: /dev/sda:/dev/sdb:/dev/sdc:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg, target: /mnt/storage, options: degraded,errors=remount-ro
DEBUG - bcachefs::commands::mount: parsing mount options: degraded,errors=remount-ro
INFO - bcachefs::commands::mount: mounting filesystem
ERROR - bcachefs::commands::mount: Fatal error: Invalid argument
And in dmesg:
[ 3569.290085] bcachefs: bch2_fs_open() bch_fs_open err opening /dev/sda: insufficient_devices_to_start
If I try and mount it with -o very_degraded it gives the same output. Using mount.bcachefs and mount -t bcachefs give the same output, as does using UUID=55cfeccc-d8b2-4813-b1a4-9ff9212962e7.
I saw that you can remove a disk by ID so I also tried:
❯ sudo bcachefs device remove 4
Filesystem path required when specifying device by id
So it seems that would only work if I could mount the array first, which is exactly the problem.
❯ sudo bcachefs show-super /dev/sda
Device: ST14000NM001G-2K
External UUID: 55cfeccc-d8b2-4813-b1a4-9ff9212962e7
Internal UUID: fb8e9660-7eb6-45b4-a62a-bfbe3458974e
Magic number: c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index: 0
Label:
Version: 1.7: mi_btree_bitmap
Version upgrade complete: 1.7: mi_btree_bitmap
Oldest version on disk: 1.3: rebalance_work
Created: Sun Jan 21 14:27:32 2024
Sequence number: 669
Time of last write: Sun Jun 30 03:01:25 2024
Superblock size: 10.6 KiB/1.00 MiB
Clean: 0
Devices: 8
Sections: members_v1,replicas_v0,disk_groups,clean,journal_seq_blacklist,journal_v2,counters,members_v2,errors,ext,downgrade
Features: journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done
Options:
block_size: 4.00 KiB
btree_node_size: 256 KiB
errors: continue [fix_safe] panic ro
metadata_replicas: 2
data_replicas: 1
metadata_replicas_required: 1
data_replicas_required: 1
encoded_extent_max: 64.0 KiB
metadata_checksum: none [crc32c] crc64 xxhash
data_checksum: none [crc32c] crc64 xxhash
compression: none
background_compression: none
str_hash: crc32c crc64 [siphash]
metadata_target: none
foreground_target: none
background_target: none
promote_target: none
erasure_code: 0
inodes_32bit: 1
shard_inode_numbers: 1
inodes_use_key_cache: 1
gc_reserve_percent: 8
gc_reserve_bytes: 0 B
root_reserve_percent: 0
wide_macs: 0
acl: 1
usrquota: 0
grpquota: 0
prjquota: 0
journal_flush_delay: 1000
journal_flush_disabled: 0
journal_reclaim_delay: 100
journal_transaction_names: 1
version_upgrade: [compatible] incompatible none
nocow: 0
members_v2 (size 1104):
Device: 0
Label: hd1 (1)
UUID: 4334f09b-b198-4957-8e13-0bdfc3eb8c42
Size: 12.7 TiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 512 KiB
First bucket: 0
Buckets: 26703872
Last mount: Fri Jun 21 08:10:49 2024
Last superblock write: 669
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 512 MiB
Btree allocated bitmap: 0000000000000000000000000000000111111111111111111111111111111111
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 1
Label: hd2 (2)
UUID: 97050b01-c590-479e-a3d8-f7a1c1337c54
Size: 2.73 TiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 512 KiB
First bucket: 0
Buckets: 5723176
Last mount: Fri Jun 21 08:10:49 2024
Last superblock write: 669
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 64.0 MiB
Btree allocated bitmap: 0000000011111111111111111111111111111111111111111111111111111111
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 2
Label: hd3 (3)
UUID: 49f40e0e-fa15-4015-be19-49584b2cf1e4
Size: 2.73 TiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 512 KiB
First bucket: 0
Buckets: 5723176
Last mount: Fri Jun 21 08:10:49 2024
Last superblock write: 669
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 64.0 MiB
Btree allocated bitmap: 0000000011111111111111111111111111111111111111111111111111111111
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 3
Label: hd4 (4)
UUID: 95a6b8ee-8de9-438b-afb0-3f08b3d2d253
Size: 2.73 TiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 512 KiB
First bucket: 0
Buckets: 5723176
Last mount: Fri Jun 21 08:10:49 2024
Last superblock write: 669
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 64.0 MiB
Btree allocated bitmap: 0000000011111111111111111111111111111111111111111111111111111111
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 4
Label: hd5 (5)
UUID: c5eacabf-9858-4897-b4b3-d7aa6bd16b45
Size: 12.7 TiB
read errors: 24
write errors: 49
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 1.00 MiB
First bucket: 0
Buckets: 13351936
Last mount: Fri Jun 21 08:10:49 2024
Last superblock write: 669
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 512 MiB
Btree allocated bitmap: 0000000000000000000000000000000111111111111111111111111111111111
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 5
Label: hd6 (6)
UUID: 2eeb7acc-7e82-4a16-a18c-f32ca665346c
Size: 7.28 TiB
read errors: 0
write errors: 1
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 1.00 MiB
First bucket: 0
Buckets: 7630885
Last mount: Fri Jun 21 08:10:49 2024
Last superblock write: 669
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 128 MiB
Btree allocated bitmap: 1111111111111111111111111111111111111111111111111111111111111111
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 6
Label: hd7 (7)
UUID: 42798165-be77-49a2-a5c9-b4c998a740c4
Size: 12.7 TiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 1.00 MiB
First bucket: 0
Buckets: 13351936
Last mount: Fri Jun 21 08:10:49 2024
Last superblock write: 669
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 256 MiB
Btree allocated bitmap: 0000000000000000000000001111111111111111111111111111111111111111
Durability: 1
Discard: 0
Freespace initialized: 1
Device: 7
Label: hd8 (8)
UUID: 92c555b5-284b-4dba-8424-640ecb812315
Size: 7.28 TiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 1.00 MiB
First bucket: 0
Buckets: 7630885
Last mount: Fri Jun 21 08:10:49 2024
Last superblock write: 669
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 64.0 MiB
Btree allocated bitmap: 0011111111111111111111111111111111111111111111111111111111111111
Durability: 1
Discard: 0
Freespace initialized: 1
errors (size 8):
Some extra info:
❯ uname -r
6.9.7-arch1-1
❯ bcachefs version
1.9.1
❯ paru -Qs bcachefs
local/bcachefs-tools 3:1.9.2-1
Alright, so I decided to try downgrading bcachefs-tools to 1.7.0 and giving it another try and lo and behold, it worked! So this seems like it might just be a tools bug in 1.9.x.
The command I ended up running:
sudo bcachefs mount UUID=55cfeccc-d8b2-4813-b1a4-9ff9212962e7 /mnt/storage \
-o fsck,fix_errors,very_degraded,nochanges,read_only,opts=ro,errors=ro
It took several hours, and spat out a lot of
btree trans held srcu lock (delaying memory reclaim) by more than 31 seconds
warnings, but otherwise seems to have had no further trouble mounting the 7 remaining disks.
I'm backing up the important stuff before I try any more things, and obviously right now there is a lot of this in dmesg (definitely not unexpected at this point):
[111019.630521] bcachefs (55cfeccc-d8b2-4813-b1a4-9ff9212962e7 inum 335745235 offset 1835008): no device to read from [111019.717764] bcachefs (sde inum 134673851 offset 40): data read error: I/O [111019.717841] bcachefs (55cfeccc-d8b2-4813-b1a4-9ff9212962e7 inum 134673851 offset 20480): read error 3 from btree lookup
Having the same issue. Trying the downgrade solution now.
This will be the second issue I've run into as a result of a newer bcachefs-tools package (disk version 1.9 this time) operating on arrays with the released kernel version (disk version 1.7 this time), so I think there are some non-trivial edge cases that are getting missed due to version shear. Perhaps the kernel-version matched bcachefs tools should be made the primary package with the latest tools version as a secondary or dev package. Note that I'm also on Arch, like the reporter.
I wasn't able to resolve my issue by downgrading bcachefs-tools (it hung during the backpointer checks on tools 1.7), but I was able to recover the array by upgrading the kernel to the current linux-next-git rather than the regular linux package (6.10.something). So definitely a version shearing issue or bug in kernel 6.10, but one that seems to have been resolved at HEAD. The only bad news is that if people lose an array right now, they might be hosed until the next kernel release (unless they follow similar steps).
Clarification: installing linux-next-git and adding a bunch of swap space allowed me to complete a clean fsck run, not mount the filesystem. Getting the array mounted once clean did require downgrading to bcachefs-tools version 1.7.0.
Now that I've gotten everything fully back up, it seems I hit two issues:
a) Trying to mount with tools version 1.9.0 and fewer than the ideal set of disks failed with "invalid argument" even with -o degraded (insufficient disks to start, per kernel logs). My intuition is that this is related to the new code that scans for bcachefs superblocks being active even when an explicit colon-separated device list is given on the command line. Downgrading to tools 1.7.0 allowed me to mount the filesystem with -o degraded by manually specifying the available devices.
b) Confounding factor: filesystem was not clean due to OOM during the previous device remove operation. Running fsck on a 12TB filesystem with a missing disk needed about 58GB of virtual memory at peak, which was the root cause of the failed fsck runs. Because fsck runs automatically on mount if unclean (true for all attempts in my case), it made mounting with tools 1.7.0 look like a failure because the kernel ran out of memory and crashed or locked up the host, prompting reboots. Getting a clean fsck first by adding more swap space allowed me to mount via tools 1.7.0 like the original reporter.
It's unclear if upgrading to linux-next-git was necessary. I think adding the additional swap space so fsck could complete was the primary fix. However, the error messages from both the kernel and bcachefs-tools code were better, which helped. So thanks for improving those :) That said, I looked at the changelog, and there were several malloc and deadlock related changes between mainline and next that might have unblocked the fsck runs. Like I said: unclear. Regardless, all is well again with my filesystems, so this should be my last update unless there is anything I can provide to help debug or confirm a fix.
1.9.1, and possibly 1.9.0 as well had a bug in the mount helper that resulted in mount options not getting passed through. Can you check that? Either build a newer version of -tools, or mount without the helper (-i).
I'm not able to do testing in the next day or two because I'll be out of cell service in the woods, but will try to confirm that when I'm back.
Could you add a mention of the -i flag on bcachefs mount --help though? It sounds like it can be relevant in stressful situations, and I was not even aware that flag existed, despite looking for things exactly like that to try when I was initially debugging the issue.
Just for the sake of appreciation: this has been my only negative experience with bcachefs other than some deadlocks in the first kernel version. Your work on this is much appreciated and I'm proud to have been a supporter for several years.
So FWIW, this seems to be working for me now (although maybe I had a different issue)
bcachefs-tools 3:1.13.0-1
6.12.4-arch1-1
In a lab env I did:
$ losetup 2 disks...
$ echo abcd | bcachefs format --uuid=3d02cfd4-968a-4fe4-a2a0-fe84614485f6 --promote_target=ssd --foreground_target=ssd --replicas=2 --metadata_replicas_required=2 --data_replicas_required=2 --compression=zstd --encrypted --label=ssd.ssd0 /dev/loop2 --label=ssd.ssd1 /dev/loop3 --force
$ sudo losetup -d /dev/loop2
$ sudo bcachefs mount -o degraded,fsck,fix_errors --key_location=fail UUID=3d02cfd4-968a-4fe4-a2a0-fe84614485f6 /mnt/test
$ ls /mnt/test
However, I had actually done the same test a couple months ago. Before the above re-formatting, I tested the same procedure with the old volumes and wasn't able to mount, with bcachefs mount or mount -t bcachefs -i. Dmesg showed logs like:
[Sat Feb 1 14:35:01 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): starting version 1.13: inode_has_child_snapshots opts=metadata_replicas=2,data_replicas=2,metadata_replicas_required=2,data_replicas_required=2,compression=zstd,foreground_target=ssd,promote_target=ssd,very_degraded,fsck,fix_errors=yes
[Sat Feb 1 14:35:01 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): initializing new filesystem
[Sat Feb 1 14:35:01 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): insufficient writeable journal devices available: have 1, need 2
rw journal devs: loop3
[Sat Feb 1 14:35:01 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): going read-write
[Sat Feb 1 14:35:01 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): bch2_dev_usage_init(): error erofs_journal_err
[Sat Feb 1 14:35:01 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): bch2_fs_initialize(): error erofs_journal_err
[Sat Feb 1 14:35:01 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): bch2_fs_start(): error starting filesystem erofs_journal_err
[Sat Feb 1 14:35:01 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): shutdown complete, journal seq 0
[Sat Feb 1 14:35:02 2025] bcachefs: bch2_fs_get_tree() error: erofs_journal_err
[Sat Feb 1 14:35:29 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): starting version 1.13: inode_has_child_snapshots opts=metadata_replicas=2,data_replicas=2,metadata_replicas_required=2,data_replicas_required=2,compression=zstd,foreground_target=ssd,promote_target=ssd,degraded,fsck,fix_errors=yes
[Sat Feb 1 14:35:29 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): initializing new filesystem
[Sat Feb 1 14:35:29 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): insufficient writeable journal devices available: have 1, need 2
rw journal devs: loop3
[Sat Feb 1 14:35:29 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): going read-write
[Sat Feb 1 14:35:29 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): bch2_dev_usage_init(): error erofs_journal_err
[Sat Feb 1 14:35:29 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): bch2_fs_initialize(): error erofs_journal_err
[Sat Feb 1 14:35:29 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): bch2_fs_start(): error starting filesystem erofs_journal_err
[Sat Feb 1 14:35:29 2025] bcachefs (3d02cfd4-968a-4fe4-a2a0-fe84614485f6): shutdown complete, journal seq 0
[Sat Feb 1 14:35:29 2025] bcachefs: bch2_fs_get_tree() error: erofs_journal_err
oh yes, the option passing bug - this was fixed long ago