bcachefs icon indicating copy to clipboard operation
bcachefs copied to clipboard

bcachefs as root file system is hanging indefinitely, unknown cause

Open koshell opened this issue 1 year ago • 6 comments

For context: I'm installing Arch on a tiny little laptop. I have only just gotten the thing booting into the distro (haven't even created a user account yet) and for some reason my root partition (that I formatted with bcachefs) appears to be causing issues.

Specifically reading files from the device appears okay, however writing anything to it appears to hang forever, this makes diagnosing the issue difficult. I'm not ruling out a hardware issue (laptop is a pile of garbage on the best of days) but I have confirmed that 'badblocks' doesn't see anything wrong with the underlying block device.

I was able to extract a dmesg log to a usb drive which I hope helps, it appears to be complaining about 'bch-copygc' but I really don't know how to interpret these errors so I'll just include the log in it's entirety: kernel.log

koshell avatar Jul 18 '24 03:07 koshell

I'm unsure what other information would be of value, if there is any other info that would be of value I can try to collect it.

koshell avatar Jul 20 '24 22:07 koshell

kernel version would be a starting point

koverstreet avatar Jul 21 '24 00:07 koverstreet

Linux version: 6.9.7-arch1-1

koshell avatar Jul 21 '24 03:07 koshell

This is one of the errors I saw in the log:

[  368.345299] INFO: task bch-copygc/dm-0:298 blocked for more than 122 seconds.
[  368.348641]       Not tainted 6.9.7-arch1-1 #1
[  368.352015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  368.353103] task:bch-copygc/dm-0 state:D stack:0     pid:298   tgid:298   ppid:2      flags:0x00004000
[  368.353115] Call Trace:
[  368.353118]  <TASK>
[  368.353122]  __schedule+0x3c7/0x1510
[  368.353136]  schedule+0x27/0xf0
[  368.353142]  __closure_sync+0x7e/0x140
[  368.353150]  __bch2_write+0x136b/0x1660 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353316]  ? six_relock_ip+0x38/0x80 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353459]  ? local_clock_noinstr+0xd/0xd0
[  368.353465]  ? __kmalloc+0x1a7/0x440
[  368.353475]  ? local_clock_noinstr+0xd/0xd0
[  368.353480]  ? local_clock+0x15/0x30
[  368.353487]  ? bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353633]  bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353780]  ? bch2_btree_path_traverse_one+0x958/0xcf0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.353908]  bch2_data_update_init+0x68b/0x1420 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354052]  ? bch2_move_extent+0x3da/0xed0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354197]  bch2_move_extent+0x3da/0xed0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354347]  ? bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354492]  bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354644]  ? bch2_copygc+0x210/0x880 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354816]  bch2_copygc+0x210/0x880 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.354972]  bch2_copygc_thread+0x152/0x3d0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.355105]  ? bch2_copygc_thread+0xcf/0x3d0 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.355238]  ? __pfx_bch2_copygc_thread+0x10/0x10 [bcachefs aa0637810d467c8e6cf072acf6a70476543ba202]
[  368.355367]  kthread+0xd2/0x100
[  368.355374]  ? __pfx_kthread+0x10/0x10
[  368.355380]  ret_from_fork+0x34/0x50
[  368.355386]  ? __pfx_kthread+0x10/0x10
[  368.355391]  ret_from_fork_asm+0x1a/0x30
[  368.355398]  </TASK>

I'll see if I can get it to mount the debugfs and get more info.

koshell avatar Jul 21 '24 03:07 koshell

That looks like copygc blocking on the allocator - doh.

Try 6.10; it dumps a bunch of info when the allocator get stucks.

koverstreet avatar Aug 09 '24 07:08 koverstreet

Probably related:

[  247.065254] INFO: task kworker/u65:0:114 blocked for more than 122 seconds.
[  247.065359]       Not tainted 6.10.8 #1-NixOS
[  247.065416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.065505] task:kworker/u65:0   state:D stack:0     pid:114   tgid:114   ppid:2      flags:0x00004000
[  247.065518] Workqueue: events_unbound __bch2_read_endio [bcachefs]
[  247.065672] Call Trace:
[  247.065676]  <TASK>
[  247.065687]  __schedule+0x3fa/0x1560
[  247.065711]  ? __pfx_bch2_six_check_for_deadlock+0x10/0x10 [bcachefs]
[  247.065831]  schedule+0x27/0xf0
[  247.065842]  six_lock_slowpath.isra.0+0x207/0x4c0 [bcachefs]
[  247.065987]  bch2_btree_node_get+0x38a/0x4b0 [bcachefs]
[  247.066097]  ? __bch2_rbio_narrow_crcs+0x145/0x3c0 [bcachefs]
[  247.066269]  bch2_btree_path_traverse_one+0x759/0xd20 [bcachefs]
[  247.066424]  ? __bch2_rbio_narrow_crcs+0x145/0x3c0 [bcachefs]
[  247.066635]  ? bch2_btree_path_traverse_one+0xe5/0xd20 [bcachefs]
[  247.066795]  ? srso_return_thunk+0x5/0x5f
[  247.066806]  ? __bch2_btree_path_make_mut+0x42/0x2c0 [bcachefs]
[  247.066974]  bch2_btree_iter_peek_slot+0xef/0x750 [bcachefs]
[  247.067133]  ? srso_return_thunk+0x5/0x5f
[  247.067142]  ? btree_path_get_locks+0x67/0x210 [bcachefs]
[  247.067317]  ? srso_return_thunk+0x5/0x5f
[  247.067327]  ? bch2_path_get+0x385/0x3a0 [bcachefs]
[  247.067485]  ? __bch2_rbio_narrow_crcs+0x13d/0x3c0 [bcachefs]
[  247.067673]  ? __lruvec_stat_mod_folio+0x81/0xa0
[  247.067692]  __bch2_rbio_narrow_crcs+0x145/0x3c0 [bcachefs]
[  247.067885]  ? __bch2_time_stats_update+0xec/0x320 [bcachefs]
[  247.068074]  bch2_rbio_narrow_crcs+0x3c/0xc0 [bcachefs]
[  247.068234]  __bch2_read_endio+0x8c2/0x990 [bcachefs]
[  247.068415]  ? srso_return_thunk+0x5/0x5f
[  247.068422]  ? ttwu_queue_wakelist+0xd8/0x100
[  247.068433]  ? srso_return_thunk+0x5/0x5f
[  247.068438]  ? try_to_wake_up+0x208/0x660
[  247.068450]  ? process_one_work+0x192/0x3b0
[  247.068459]  ? __pfx___bch2_read_endio+0x10/0x10 [bcachefs]
[  247.068585]  process_one_work+0x192/0x3b0
[  247.068597]  worker_thread+0x244/0x350
[  247.068606]  ? __pfx_worker_thread+0x10/0x10
[  247.068615]  kthread+0xd0/0x100
[  247.068623]  ? __pfx_kthread+0x10/0x10
[  247.068630]  ret_from_fork+0x34/0x50
[  247.068638]  ? __pfx_kthread+0x10/0x10
[  247.068645]  ret_from_fork_asm+0x1a/0x30
[  247.068663]  </TASK>
[  247.068683] INFO: task kworker/u65:7:325 blocked for more than 122 seconds.
[  247.068759]       Not tainted 6.10.8 #1-NixOS
[  247.068812] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.068891] task:kworker/u65:7   state:D stack:0     pid:325   tgid:325   ppid:2      flags:0x00004000
[  247.068902] Workqueue: events_unbound __bch2_read_endio [bcachefs]
[  247.069032] Call Trace:
[  247.069036]  <TASK>
[  247.069042]  __schedule+0x3fa/0x1560
[  247.069058]  ? __pfx_bch2_six_check_for_deadlock+0x10/0x10 [bcachefs]
[  247.069164]  schedule+0x27/0xf0
[  247.069173]  six_lock_slowpath.isra.0+0x207/0x4c0 [bcachefs]
[  247.069315]  __bch2_btree_node_lock_write+0x93/0x120 [bcachefs]
[  247.069483]  bch2_trans_lock_write+0x20c/0x270 [bcachefs]
[  247.069653]  __bch2_trans_commit+0x6f2/0x1720 [bcachefs]
[  247.069834]  bch2_rbio_narrow_crcs+0x7b/0xc0 [bcachefs]
[  247.070024]  __bch2_read_endio+0x8c2/0x990 [bcachefs]
[  247.070227]  ? srso_return_thunk+0x5/0x5f
[  247.070236]  ? ttwu_queue_wakelist+0xd8/0x100
[  247.070248]  ? srso_return_thunk+0x5/0x5f
[  247.070256]  ? try_to_wake_up+0x208/0x660
[  247.070270]  ? process_one_work+0x192/0x3b0
[  247.070280]  ? __pfx___bch2_read_endio+0x10/0x10 [bcachefs]
[  247.070468]  process_one_work+0x192/0x3b0
[  247.070483]  worker_thread+0x244/0x350
[  247.070495]  ? __pfx_worker_thread+0x10/0x10
[  247.070506]  kthread+0xd0/0x100
[  247.070515]  ? __pfx_kthread+0x10/0x10
[  247.070525]  ret_from_fork+0x34/0x50
[  247.070533]  ? __pfx_kthread+0x10/0x10
[  247.070542]  ret_from_fork_asm+0x1a/0x30
[  247.070564]  </TASK>
[  247.070590] INFO: task qemu-img:1220 blocked for more than 122 seconds.
[  247.070696]       Not tainted 6.10.8 #1-NixOS
[  247.070769] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.070886] task:qemu-img        state:D stack:0     pid:1220  tgid:1220  ppid:1153   flags:0x00000002
[  247.070898] Call Trace:
[  247.070903]  <TASK>
[  247.070911]  __schedule+0x3fa/0x1560
[  247.070922]  ? bch2_btree_iter_peek_upto+0x824/0xd30 [bcachefs]
[  247.071095]  schedule+0x27/0xf0
[  247.071106]  io_schedule+0x46/0x70
[  247.071117]  folio_wait_bit_common+0x13f/0x340
[  247.071134]  ? __pfx_wake_page_function+0x10/0x10
[  247.071150]  __filemap_get_folio+0x1fb/0x2d0
[  247.071164]  bch2_seek_pagecache_hole+0x50/0x1d0 [bcachefs]
[  247.071365]  bch2_seek_hole+0x290/0x390 [bcachefs]
[  247.071566]  bch2_llseek+0x1a/0x70 [bcachefs]
[  247.071751]  ksys_lseek+0x76/0xc0
[  247.071765]  do_syscall_64+0xb7/0x210
[  247.071777]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  247.071788] RIP: 0033:0x7f34e028076b
[  247.071827] RSP: 002b:00007f34df74ac08 EFLAGS: 00000246 ORIG_RAX: 0000000000000008
[  247.071837] RAX: ffffffffffffffda RBX: 000000008414a000 RCX: 00007f34e028076b
[  247.071843] RDX: 0000000000000004 RSI: 000000008414a000 RDI: 0000000000000006
[  247.071849] RBP: 000055ac3df8ccd0 R08: 00007f34df74aec0 R09: 00007f34df74aca0
[  247.071855] R10: 0000003200000000 R11: 0000000000000246 R12: 000000007ffffe00
[  247.071861] R13: 00007f34df74aec0 R14: 00007f34df74aca0 R15: 000055ac3df941c0
[  247.071879]  </TASK>
 ✘ ⚡ root@nixos-host  ~  bcachefs fs usage
Filesystem: c4b20f5d-75f5-4156-adda-7b42af4ad67b
Size:                  1763095273984
Used:                   507091441664
Online reserved:              102400

Data type       Required/total  Durability    Devices
reserved:       1/0             [] 10198462464
btree:          1/1             1             [nvme0n1p2]       2645032960
user:           1/1             1             [nvme0n1p2]     492616028160

(no label) (device 0):     nvme0n1p2              rw
                                data         buckets    fragmented
  free:                1410106130432         5379128
  sb:                        3149824              13        258048
  journal:                1469054976            5604
  btree:                  2645032960           10090
  user:                 492616028160         1915679    9567727616
  cached:                          0               0
  parity:                          0               0
  stripe:                          0               0
  need_gc_gens:                    0               0
  need_discard:               524288               2
  capacity:            1916407906304         7310516
 ✘ ⚡ root@nixos-host  ~  bcachefs show-super /dev/nvme0n1p2
Device:                                     (unknown device)
External UUID:                             c4b20f5d-75f5-4156-adda-7b42af4ad67b
Internal UUID:                             0b5cbf3a-3c4c-4162-b73c-c7b11432ed91
Magic number:                              c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index:                              0
Label:                                     (none)
Version:                                   1.7: mi_btree_bitmap
Version upgrade complete:                  1.7: mi_btree_bitmap
Oldest version on disk:                    1.7: mi_btree_bitmap
Created:                                   Sat Sep  7 15:44:03 2024
Sequence number:                           135
Time of last write:                        Wed Sep 11 06:32:20 2024
Superblock size:                           4.71 KiB/1.00 MiB
Clean:                                     0
Devices:                                   1
Sections:                                  members_v1,crypt,replicas_v0,clean,journal_seq_blacklist,journal_v2,counters,members_v2,errors,ext,downgrade
Features:                                  journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data,new_va
rint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:                           alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
  block_size:                              4.00 KiB
  btree_node_size:                         256 KiB
  errors:                                  continue [fix_safe] panic ro
  metadata_replicas:                       1
  data_replicas:                           1
  metadata_replicas_required:              1
  data_replicas_required:                  1
  encoded_extent_max:                      64.0 KiB
  metadata_checksum:                       none [crc32c] crc64 xxhash
  data_checksum:                           none [crc32c] crc64 xxhash
  compression:                             none
  background_compression:                  none
  str_hash:                                crc32c crc64 [siphash]
  metadata_target:                         none
  foreground_target:                       none
  background_target:                       none
  promote_target:                          none
  erasure_code:                            0
  inodes_32bit:                            1
  shard_inode_numbers:                     1
  inodes_use_key_cache:                    1
  gc_reserve_percent:                      8
  gc_reserve_bytes:                        0 B
  root_reserve_percent:                    0
  wide_macs:                               0
  promote_whole_extents:                   1
  acl:                                     1
  usrquota:                                0
  grpquota:                                0
  prjquota:                                0
  journal_flush_delay:                     1000
  journal_flush_disabled:                  0
  journal_reclaim_delay:                   100
  journal_transaction_names:               1
  allocator_stuck_timeout:                 30
  version_upgrade:                         [compatible] incompatible none
  nocow:                                   0

members_v2 (size 160):
Device:                                    0
  Label:                                   (none)
  UUID:                                    62fc50fe-dbd4-4e9b-8bc8-d51a6bc2e54a
  Size:                                    1.74 TiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             256 KiB
  First bucket:                            0
  Buckets:                                 7310516
  Last mount:                              Wed Sep 11 06:32:11 2024
  Last superblock write:                   135
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                journal,btree,user
  Btree allocated bitmap blocksize:        16.0 MiB
  Btree allocated bitmap:                  0000000000000000000000100000000000000000000000001110000100000111
  Durability:                              1
  Discard:                                 0
  Freespace initialized:                   1

Stuck already twice when I run: qemu-img convert -O qcow2 vm-101-disk-1.raw image-converted.qcow2

Standard NixOS kernel with ACS patch.

EDIT: Not just twice but probably everytime. fsck don't help.

EDIT 2: logs from 6.11.0-rc6: log2.txt

pio2398 avatar Sep 11 '24 06:09 pio2398

Did this get fixed?

koverstreet avatar Aug 02 '25 00:08 koverstreet