Allocator getting stuck (write workload with small foreground device and metadata_replicas=2)
This is a new filesystem, getting stuck with the initial rsync to bring data in.
Kernel is bcachefs/master (9b4ab159abcd84cf0c25ee851dda8c40baffecc8) (merged with ca91b9500108d4cf083a635c2e11c884d5dd20ea from Linus)
bcachefs format --encrypted --metadata_replicas=2 --metadata_checksum xxhash --data_checksum xxhash --compression=lz4 --background_compression=zstd --fs_label=Home --discard --label=hdd /dev/Consolidated/NewHome --label=ssd /dev/Consolidated/HomeCache --foreground_target=ssd --background_target=hdd --promote_target=ssd
(Note metadata_replicas)
bcachefs show-super is currently
Device: (unknown device)
External UUID: 2c9b10da-e32f-44ab-b303-d2cd1005acf2
Internal UUID: ee332d9f-f329-4f7f-8037-91cea1c3bdda
Magic number: c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index: 0
Label: Home
Version: 1.25: extent_flags
Incompatible features allowed: 1.25: extent_flags
Incompatible features in use: 0.0: (unknown version)
Version upgrade complete: 1.25: extent_flags
Oldest version on disk: 1.25: extent_flags
Created: Wed Apr 30 09:15:57 2025
Sequence number: 57
Time of last write: Wed Apr 30 15:21:03 2025
Superblock size: 2.30 KiB/1.00 MiB
Clean: 0
Devices: 2
Sections: members_v1,crypt,replicas_v0,disk_groups,journal_v2,counters,members_v2,errors,ext,downgrade
Features: lz4,zstd,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes,incompat_version_field
Compat features: extents_above_btree_updates_done,bformat_overflow_done
Options:
block_size: 4.00 KiB
btree_node_size: 256 KiB
errors: continue [fix_safe] panic ro
write_error_timeout: 30
metadata_replicas: 2
data_replicas: 1
metadata_replicas_required: 1
data_replicas_required: 1
encoded_extent_max: 64.0 KiB
metadata_checksum: none crc32c crc64 [xxhash]
data_checksum: none crc32c crc64 [xxhash]
checksum_err_retry_nr: 3
compression: lz4
background_compression: zstd
str_hash: crc32c crc64 [siphash]
metadata_target: none
foreground_target: ssd
background_target: hdd
promote_target: ssd
erasure_code: 0
casefold: 0
inodes_32bit: 1
shard_inode_numbers_bits: 4
inodes_use_key_cache: 1
gc_reserve_percent: 8
gc_reserve_bytes: 0 B
root_reserve_percent: 0
wide_macs: 0
promote_whole_extents: 1
acl: 1
usrquota: 0
grpquota: 0
prjquota: 0
degraded: [ask] yes very no
journal_flush_delay: 1000
journal_flush_disabled: 0
journal_reclaim_delay: 100
journal_transaction_names: 1
allocator_stuck_timeout: 30
version_upgrade: [compatible] incompatible none
nocow: 0
layout:
Type: 0
Superblock max size: 1.00 MiB
Nr superblocks: 3
Offsets: 8, 2056, 7813500928
members_v2 (size 304):
Device: 0
Label: hdd (0)
UUID: 6b61a3e6-529b-4e09-a7c1-84c7670f64b0
Size: 3.64 TiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 2.00 MiB
First bucket: 0
Buckets: 1907594
Last mount: Wed Apr 30 09:23:07 2025
Last superblock write: 57
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 64.0 MiB
Btree allocated bitmap: 0000000000001000000000000000010000000000000001001000000100110111
Durability: 1
Discard: 1
Freespace initialized: 1
Resize on mount: 0
Device: 1
Label: ssd (1)
UUID: dff9852b-6fcc-44fb-b0b8-73b278cbffad
Size: 128 GiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 2.00 MiB
First bucket: 0
Buckets: 65536
Last mount: Wed Apr 30 09:23:07 2025
Last superblock write: 57
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user,cached
Btree allocated bitmap blocksize: 4.00 MiB
Btree allocated bitmap: 0000000000100000000000000000000100000000100000000100000000011111
Durability: 1
Discard: 1
Freespace initialized: 1
Resize on mount: 0
errors (size 8):
Usage is:
Filesystem: 2c9b10da-e32f-44ab-b303-d2cd1005acf2
Size: 3.46 TiB
Used: 1.68 TiB
Online reserved: 357 MiB
Data type Required/total Durability Devices
btree: 1/2 2 [dm-22 dm-21] 18.9 GiB
user: 1/1 1 [dm-22] 1.54 TiB
user: 1/1 1 [dm-21] 117 GiB
Compression:
type compressed uncompressed average extent size
lz4 82.8 GiB 200 GiB 51.3 KiB
zstd 42.7 GiB 98.7 GiB 52.3 KiB
incompressible 1.54 TiB 1.54 TiB 58.0 KiB
Btree usage:
extents: 8.54 GiB
inodes: 2.83 GiB
dirents: 1.09 GiB
alloc: 249 MiB
subvolumes: 512 KiB
snapshots: 512 KiB
lru: 1.00 MiB
freespace: 512 KiB
need_discard: 3.00 MiB
backpointers: 5.25 GiB
bucket_gens: 2.00 MiB
snapshot_trees: 512 KiB
deleted_inodes: 512 KiB
logged_ops: 512 KiB
rebalance_work: 472 MiB
accounting: 484 MiB
Pending rebalance work:
200 GiB
hdd (device 0): dm-22 rw
data buckets fragmented
free: 1.99 TiB 1041517
sb: 3.00 MiB 3 3.00 MiB
journal: 8.00 GiB 4096
btree: 9.45 GiB 4855 35.8 MiB
user: 1.54 TiB 809999 208 MiB
cached: 0 B 0
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 92.0 GiB 47124
unstriped: 0 B 0
capacity: 3.64 TiB 1907594
ssd (device 1): dm-21 rw
data buckets fragmented
free: 52.0 MiB 26
sb: 3.00 MiB 3 3.00 MiB
journal: 1.00 GiB 512
btree: 9.45 GiB 4855 35.8 MiB
user: 117 GiB 60132
cached: 0 B 0
parity: 0 B 0
stripe: 0 B 0
need_gc_gens: 0 B 0
need_discard: 16.0 MiB 8
unstriped: 0 B 0
capacity: 128 GiB 65536
Here is dmesg, at the end is the backtrace rsync is blocked on.
This got things unstuck:
echo 1 > /sys/fs/bcachefs/…/options/metadata_replicas
bcachefs data job drop_extra_replicas /mnt/…
Ideally there would be a reserve so that there is always enough btree space to move data to the background device
So this is a two device filesystem, replicas=2, and mismatched device sizes? This is a known bug, capacity calculations don't take this into account