bcachefs
bcachefs copied to clipboard
bcachefs as root file system is hanging indefinitely, unknown cause
For context: I'm installing Arch on a tiny little laptop. I have only just gotten the thing booting into the distro (haven't even created a user account yet) and for some reason my root partition (that I formatted with bcachefs) appears to be causing issues.
Specifically reading files from the device appears okay, however writing anything to it appears to hang forever, this makes diagnosing the issue difficult. I'm not ruling out a hardware issue (laptop is a pile of garbage on the best of days) but I have confirmed that 'badblocks' doesn't see anything wrong with the underlying block device.
I was able to extract a dmesg log to a usb drive which I hope helps, it appears to be complaining about 'bch-copygc' but I really don't know how to interpret these errors so I'll just include the log in it's entirety: kernel.log
I'm unsure what other information would be of value, if there is any other info that would be of value I can try to collect it.
kernel version would be a starting point
Linux version: 6.9.7-arch1-1
This is one of the errors I saw in the log:
[ 368.345299] INFO: task bch-copygc/dm-0:298 blocked for more than 122 seconds.
[ 368.348641] Not tainted 6.9.7-arch1-1 #1
[ 368.352015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 368.353103] task:bch-copygc/dm-0 state:D stack:0 pid:298 tgid:298 ppid:2 flags:0x00004000
[ 368.353115] Call Trace:
[ 368.353118] <TASK>
[ 368.353122] __schedule+0x3c7/0x1510
[ 368.353136] schedule+0x27/0xf0
[ 368.353142] __closure_sync+0x7e/0x140
[ 368.353150] __bch2_write+0x136b/0x1660 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.353316] ? six_relock_ip+0x38/0x80 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.353459] ? local_clock_noinstr+0xd/0xd0
[ 368.353465] ? __kmalloc+0x1a7/0x440
[ 368.353475] ? local_clock_noinstr+0xd/0xd0
[ 368.353480] ? local_clock+0x15/0x30
[ 368.353487] ? bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.353633] bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.353780] ? bch2_btree_path_traverse_one+0x958/0xcf0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.353908] bch2_data_update_init+0x68b/0x1420 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.354052] ? bch2_move_extent+0x3da/0xed0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.354197] bch2_move_extent+0x3da/0xed0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.354347] ? bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.354492] bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.354644] ? bch2_copygc+0x210/0x880 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.354816] bch2_copygc+0x210/0x880 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.354972] bch2_copygc_thread+0x152/0x3d0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.355105] ? bch2_copygc_thread+0xcf/0x3d0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.355238] ? __pfx_bch2_copygc_thread+0x10/0x10 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[ 368.355367] kthread+0xd2/0x100
[ 368.355374] ? __pfx_kthread+0x10/0x10
[ 368.355380] ret_from_fork+0x34/0x50
[ 368.355386] ? __pfx_kthread+0x10/0x10
[ 368.355391] ret_from_fork_asm+0x1a/0x30
[ 368.355398] </TASK>
I'll see if I can get it to mount the debugfs and get more info.
That looks like copygc blocking on the allocator - doh.
Try 6.10; it dumps a bunch of info when the allocator get stucks.
Probably related:
[ 247.065254] INFO: task kworker/u65:0:114 blocked for more than 122 seconds.
[ 247.065359] Not tainted 6.10.8 #1-NixOS
[ 247.065416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 247.065505] task:kworker/u65:0 state:D stack:0 pid:114 tgid:114 ppid:2 flags:0x00004000
[ 247.065518] Workqueue: events_unbound __bch2_read_endio [bcachefs]
[ 247.065672] Call Trace:
[ 247.065676] <TASK>
[ 247.065687] __schedule+0x3fa/0x1560
[ 247.065711] ? __pfx_bch2_six_check_for_deadlock+0x10/0x10 [bcachefs]
[ 247.065831] schedule+0x27/0xf0
[ 247.065842] six_lock_slowpath.isra.0+0x207/0x4c0 [bcachefs]
[ 247.065987] bch2_btree_node_get+0x38a/0x4b0 [bcachefs]
[ 247.066097] ? __bch2_rbio_narrow_crcs+0x145/0x3c0 [bcachefs]
[ 247.066269] bch2_btree_path_traverse_one+0x759/0xd20 [bcachefs]
[ 247.066424] ? __bch2_rbio_narrow_crcs+0x145/0x3c0 [bcachefs]
[ 247.066635] ? bch2_btree_path_traverse_one+0xe5/0xd20 [bcachefs]
[ 247.066795] ? srso_return_thunk+0x5/0x5f
[ 247.066806] ? __bch2_btree_path_make_mut+0x42/0x2c0 [bcachefs]
[ 247.066974] bch2_btree_iter_peek_slot+0xef/0x750 [bcachefs]
[ 247.067133] ? srso_return_thunk+0x5/0x5f
[ 247.067142] ? btree_path_get_locks+0x67/0x210 [bcachefs]
[ 247.067317] ? srso_return_thunk+0x5/0x5f
[ 247.067327] ? bch2_path_get+0x385/0x3a0 [bcachefs]
[ 247.067485] ? __bch2_rbio_narrow_crcs+0x13d/0x3c0 [bcachefs]
[ 247.067673] ? __lruvec_stat_mod_folio+0x81/0xa0
[ 247.067692] __bch2_rbio_narrow_crcs+0x145/0x3c0 [bcachefs]
[ 247.067885] ? __bch2_time_stats_update+0xec/0x320 [bcachefs]
[ 247.068074] bch2_rbio_narrow_crcs+0x3c/0xc0 [bcachefs]
[ 247.068234] __bch2_read_endio+0x8c2/0x990 [bcachefs]
[ 247.068415] ? srso_return_thunk+0x5/0x5f
[ 247.068422] ? ttwu_queue_wakelist+0xd8/0x100
[ 247.068433] ? srso_return_thunk+0x5/0x5f
[ 247.068438] ? try_to_wake_up+0x208/0x660
[ 247.068450] ? process_one_work+0x192/0x3b0
[ 247.068459] ? __pfx___bch2_read_endio+0x10/0x10 [bcachefs]
[ 247.068585] process_one_work+0x192/0x3b0
[ 247.068597] worker_thread+0x244/0x350
[ 247.068606] ? __pfx_worker_thread+0x10/0x10
[ 247.068615] kthread+0xd0/0x100
[ 247.068623] ? __pfx_kthread+0x10/0x10
[ 247.068630] ret_from_fork+0x34/0x50
[ 247.068638] ? __pfx_kthread+0x10/0x10
[ 247.068645] ret_from_fork_asm+0x1a/0x30
[ 247.068663] </TASK>
[ 247.068683] INFO: task kworker/u65:7:325 blocked for more than 122 seconds.
[ 247.068759] Not tainted 6.10.8 #1-NixOS
[ 247.068812] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 247.068891] task:kworker/u65:7 state:D stack:0 pid:325 tgid:325 ppid:2 flags:0x00004000
[ 247.068902] Workqueue: events_unbound __bch2_read_endio [bcachefs]
[ 247.069032] Call Trace:
[ 247.069036] <TASK>
[ 247.069042] __schedule+0x3fa/0x1560
[ 247.069058] ? __pfx_bch2_six_check_for_deadlock+0x10/0x10 [bcachefs]
[ 247.069164] schedule+0x27/0xf0
[ 247.069173] six_lock_slowpath.isra.0+0x207/0x4c0 [bcachefs]
[ 247.069315] __bch2_btree_node_lock_write+0x93/0x120 [bcachefs]
[ 247.069483] bch2_trans_lock_write+0x20c/0x270 [bcachefs]
[ 247.069653] __bch2_trans_commit+0x6f2/0x1720 [bcachefs]
[ 247.069834] bch2_rbio_narrow_crcs+0x7b/0xc0 [bcachefs]
[ 247.070024] __bch2_read_endio+0x8c2/0x990 [bcachefs]
[ 247.070227] ? srso_return_thunk+0x5/0x5f
[ 247.070236] ? ttwu_queue_wakelist+0xd8/0x100
[ 247.070248] ? srso_return_thunk+0x5/0x5f
[ 247.070256] ? try_to_wake_up+0x208/0x660
[ 247.070270] ? process_one_work+0x192/0x3b0
[ 247.070280] ? __pfx___bch2_read_endio+0x10/0x10 [bcachefs]
[ 247.070468] process_one_work+0x192/0x3b0
[ 247.070483] worker_thread+0x244/0x350
[ 247.070495] ? __pfx_worker_thread+0x10/0x10
[ 247.070506] kthread+0xd0/0x100
[ 247.070515] ? __pfx_kthread+0x10/0x10
[ 247.070525] ret_from_fork+0x34/0x50
[ 247.070533] ? __pfx_kthread+0x10/0x10
[ 247.070542] ret_from_fork_asm+0x1a/0x30
[ 247.070564] </TASK>
[ 247.070590] INFO: task qemu-img:1220 blocked for more than 122 seconds.
[ 247.070696] Not tainted 6.10.8 #1-NixOS
[ 247.070769] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 247.070886] task:qemu-img state:D stack:0 pid:1220 tgid:1220 ppid:1153 flags:0x00000002
[ 247.070898] Call Trace:
[ 247.070903] <TASK>
[ 247.070911] __schedule+0x3fa/0x1560
[ 247.070922] ? bch2_btree_iter_peek_upto+0x824/0xd30 [bcachefs]
[ 247.071095] schedule+0x27/0xf0
[ 247.071106] io_schedule+0x46/0x70
[ 247.071117] folio_wait_bit_common+0x13f/0x340
[ 247.071134] ? __pfx_wake_page_function+0x10/0x10
[ 247.071150] __filemap_get_folio+0x1fb/0x2d0
[ 247.071164] bch2_seek_pagecache_hole+0x50/0x1d0 [bcachefs]
[ 247.071365] bch2_seek_hole+0x290/0x390 [bcachefs]
[ 247.071566] bch2_llseek+0x1a/0x70 [bcachefs]
[ 247.071751] ksys_lseek+0x76/0xc0
[ 247.071765] do_syscall_64+0xb7/0x210
[ 247.071777] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 247.071788] RIP: 0033:0x7f34e028076b
[ 247.071827] RSP: 002b:00007f34df74ac08 EFLAGS: 00000246 ORIG_RAX: 0000000000000008
[ 247.071837] RAX: ffffffffffffffda RBX: 000000008414a000 RCX: 00007f34e028076b
[ 247.071843] RDX: 0000000000000004 RSI: 000000008414a000 RDI: 0000000000000006
[ 247.071849] RBP: 000055ac3df8ccd0 R08: 00007f34df74aec0 R09: 00007f34df74aca0
[ 247.071855] R10: 0000003200000000 R11: 0000000000000246 R12: 000000007ffffe00
[ 247.071861] R13: 00007f34df74aec0 R14: 00007f34df74aca0 R15: 000055ac3df941c0
[ 247.071879] </TASK>
✘ ⚡ root@nixos-host ~ bcachefs fs usage
Filesystem: c4b20f5d-75f5-4156-adda-7b42af4ad67b
Size: 1763095273984
Used: 507091441664
Online reserved: 102400
Data type Required/total Durability Devices
reserved: 1/0 [] 10198462464
btree: 1/1 1 [nvme0n1p2] 2645032960
user: 1/1 1 [nvme0n1p2] 492616028160
(no label) (device 0): nvme0n1p2 rw
data buckets fragmented
free: 1410106130432 5379128
sb: 3149824 13 258048
journal: 1469054976 5604
btree: 2645032960 10090
user: 492616028160 1915679 9567727616
cached: 0 0
parity: 0 0
stripe: 0 0
need_gc_gens: 0 0
need_discard: 524288 2
capacity: 1916407906304 7310516
✘ ⚡ root@nixos-host ~ bcachefs show-super /dev/nvme0n1p2
Device: (unknown device)
External UUID: c4b20f5d-75f5-4156-adda-7b42af4ad67b
Internal UUID: 0b5cbf3a-3c4c-4162-b73c-c7b11432ed91
Magic number: c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index: 0
Label: (none)
Version: 1.7: mi_btree_bitmap
Version upgrade complete: 1.7: mi_btree_bitmap
Oldest version on disk: 1.7: mi_btree_bitmap
Created: Sat Sep 7 15:44:03 2024
Sequence number: 135
Time of last write: Wed Sep 11 06:32:20 2024
Superblock size: 4.71 KiB/1.00 MiB
Clean: 0
Devices: 1
Sections: members_v1,crypt,replicas_v0,clean,journal_seq_blacklist,journal_v2,counters,members_v2,errors,ext,downgrade
Features: journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data,new_va
rint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done
Options:
block_size: 4.00 KiB
btree_node_size: 256 KiB
errors: continue [fix_safe] panic ro
metadata_replicas: 1
data_replicas: 1
metadata_replicas_required: 1
data_replicas_required: 1
encoded_extent_max: 64.0 KiB
metadata_checksum: none [crc32c] crc64 xxhash
data_checksum: none [crc32c] crc64 xxhash
compression: none
background_compression: none
str_hash: crc32c crc64 [siphash]
metadata_target: none
foreground_target: none
background_target: none
promote_target: none
erasure_code: 0
inodes_32bit: 1
shard_inode_numbers: 1
inodes_use_key_cache: 1
gc_reserve_percent: 8
gc_reserve_bytes: 0 B
root_reserve_percent: 0
wide_macs: 0
promote_whole_extents: 1
acl: 1
usrquota: 0
grpquota: 0
prjquota: 0
journal_flush_delay: 1000
journal_flush_disabled: 0
journal_reclaim_delay: 100
journal_transaction_names: 1
allocator_stuck_timeout: 30
version_upgrade: [compatible] incompatible none
nocow: 0
members_v2 (size 160):
Device: 0
Label: (none)
UUID: 62fc50fe-dbd4-4e9b-8bc8-d51a6bc2e54a
Size: 1.74 TiB
read errors: 0
write errors: 0
checksum errors: 0
seqread iops: 0
seqwrite iops: 0
randread iops: 0
randwrite iops: 0
Bucket size: 256 KiB
First bucket: 0
Buckets: 7310516
Last mount: Wed Sep 11 06:32:11 2024
Last superblock write: 135
State: rw
Data allowed: journal,btree,user
Has data: journal,btree,user
Btree allocated bitmap blocksize: 16.0 MiB
Btree allocated bitmap: 0000000000000000000000100000000000000000000000001110000100000111
Durability: 1
Discard: 0
Freespace initialized: 1
Stuck already twice when I run:
qemu-img convert -O qcow2 vm-101-disk-1.raw image-converted.qcow2
Standard NixOS kernel with ACS patch.
EDIT: Not just twice but probably everytime. fsck don't help.
EDIT 2: logs from 6.11.0-rc6: log2.txt
Did this get fixed?