bcachefs
bcachefs copied to clipboard
watchdog: BUG: soft lockup - CPU#2 stuck for 45s! [kworker/2:1:3036
[ 2488.235847] watchdog: BUG: soft lockup - CPU#2 stuck for 45s! [kworker/2:1:3036]
[ 2488.243461] Kernel panic - not syncing: softlockup: hung tasks
[ 2488.249350] CPU: 2 PID: 3036 Comm: kworker/2:1 Tainted: G L 5.15.0-g01b9c0b2147b+ #19
[ 2488.258568] Hardware name: System manufacturer System Product Name/P9X79 LE, BIOS 4801 07/24/2014
[ 2488.267520] Workqueue: events_long bch2_subvolume_wait_for_pagecache_and_delete
[ 2488.274936] Call Trace:
[ 2488.277456] <IRQ>
[ 2488.279533] dump_stack_lvl+0x34/0x44
[ 2488.283272] panic+0xe8/0x2ae
[ 2488.286306] ? ret_from_fork+0x2d/0x30
[ 2488.290120] watchdog_timer_fn.cold+0xc/0x16
[ 2488.294479] ? lockup_detector_update_enable+0x50/0x50
[ 2488.299716] __hrtimer_run_queues+0x125/0x270
[ 2488.304135] hrtimer_interrupt+0xf7/0x210
[ 2488.308226] __sysvec_apic_timer_interrupt+0x57/0xd0
[ 2488.313280] sysvec_apic_timer_interrupt+0x65/0x90
[ 2488.318185] </IRQ>
[ 2488.320347] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 2488.325576] RIP: 0010:queued_spin_lock_slowpath+0x3b/0x190
[ 2488.331149] Code: 5a f0 0f ba 2a 08 0f 92 c1 8b 02 0f b6 c9 c1 e1 08 30 e4 09 c8 a9 00 01 ff ff 0f 85 ed 00 00 00 85 c0 74 0e 8b 02 84 c0 74 08
kernel: 5.15.0-g01b9c0b2147b+ tools: v0.1-478-gbf7924f+
I was doing this before panic
/usr/src/bcachefs-tools/bcachefs subvolume delete /mnt/bcachefs/t/test /mnt/bcachefs/t/.snap/*
More info. I'm doing this:
copy in some files: copy file: time rsync -a -c /mnt/data_disk/linux-5.15.1/ /mnt/bcachefs/t/test/linux --delete --stats
While this is in progress take lots of snapshpts for count in {1..100} do time /usr/src/bcachefs-tools/bcachefs subvolume snapshot -r /mnt/bcachefs/t/test /mnt/bcachefs/t/.snap/$count sleep 1 done
I got 81 snapshots..... No faults.
Then delete the snapshots /usr/src/bcachefs-tools/bcachefs subvolume delete /mnt/bcachefs/t/test /mnt/bcachefs/t/.snap/*
and we get....
[ 1289.988507] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1289.988516] rcu: 7-....: (20999 ticks this GP) idle=05f/1/0x4000000000000000 softirq=13388/13388 fqs=5249
[ 1289.988526] (t=21000 jiffies g=45397 q=217)
[ 1289.988530] NMI backtrace for cpu 7
[ 1289.988532] CPU: 7 PID: 103 Comm: kworker/7:1 Not tainted 5.15.0-g01b9c0b2147b+ #19
[ 1289.988537] Hardware name: System manufacturer System Product Name/P9X79 LE, BIOS 4801 07/24/2014
[ 1289.988540] Workqueue: events_long bch2_subvolume_wait_for_pagecache_and_delete
[ 1289.988552] Call Trace:
[ 1289.988557] <IRQ>
[ 1289.988560] dump_stack_lvl+0x34/0x44
[ 1289.988567] nmi_cpu_backtrace.cold+0x30/0x70
[ 1289.988571] ? lapic_can_unplug_cpu+0x70/0x70
[ 1289.988579] nmi_trigger_cpumask_backtrace+0x80/0x90
[ 1289.988586] rcu_dump_cpu_stacks+0xb1/0xdf
[ 1289.988593] rcu_sched_clock_irq.cold+0xe8/0x220
[ 1289.988598] ? timekeeping_advance+0x373/0x550
[ 1289.988607] update_process_times+0x87/0xc0
[ 1289.988612] tick_sched_handle+0x1d/0x50
[ 1289.988620] tick_sched_timer+0x75/0xa0
[ 1289.988626] ? can_stop_idle_tick+0x90/0x90
[ 1289.988633] __hrtimer_run_queues+0x125/0x270
[ 1289.988639] hrtimer_interrupt+0xf7/0x210
[ 1289.988644] __sysvec_apic_timer_interrupt+0x57/0xd0
[ 1289.988650] sysvec_apic_timer_interrupt+0x65/0x90
[ 1289.988658] </IRQ>
[ 1289.988660] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1289.988668] RIP: 0010:queued_spin_lock_slowpath+0x3b/0x190
[ 1289.988678] Code: 5a f0 0f ba 2a 08 0f 92 c1 8b 02 0f b6 c9 c1 e1 08 30 e4 09 c8 a9 00 01 ff ff 0f 85 ed 00 00 00 85 c0 74 0e 8b 02 84 c0 74 08
One big comment.... I just reviewed the kernel configurations settings as detailed in the new testing doc (bug_report.md
I have been testing with : Voluntary Kernel Preemption (Desktop)" instead of Preemptible Kernel (Low-Latency Desktop)
This means that CONFIG_PREEMPT was not set in the past. This may account some of the issues I had.
PREEMPT
should help with debugging a soft-lock, since you can still run userspace apps to inspect the kernel stack etc.
It shouldn't fix any issues though, as most people will be using voluntary once this is upstreamed.
Still present on: 5.16.0-g020a223efab9+ bcachefs tool version v0.1-495-g59abea5+
[Wed Mar 30 15:21:35 2022] rcu: INFO: rcu_preempt self-detected stall on CPU [Wed Mar 30 15:21:35 2022] rcu: 1-....: (20999 ticks this GP) idle=33b/1/0x4000000000000000 softirq=26583/26583 fqs=5249 [Wed Mar 30 15:21:35 2022] (t=21000 jiffies g=170081 q=171) [Wed Mar 30 15:21:35 2022] NMI backtrace for cpu 1 [Wed Mar 30 15:21:35 2022] CPU: 1 PID: 86 Comm: kworker/1:1 Not tainted 5.16.0-g020a223efab9+ #28 [Wed Mar 30 15:21:35 2022] Hardware name: System manufacturer System Product Name/P9X79 LE, BIOS 4801 07/24/2014 [Wed Mar 30 15:21:35 2022] Workqueue: events_long bch2_subvolume_wait_for_pagecache_and_delete [Wed Mar 30 15:21:35 2022] Call Trace: [Wed Mar 30 15:21:35 2022] <IRQ> [Wed Mar 30 15:21:35 2022] dump_stack_lvl+0x34/0x44 [Wed Mar 30 15:21:35 2022] nmi_cpu_backtrace.cold+0x30/0x70 [Wed Mar 30 15:21:35 2022] ? lapic_can_unplug_cpu+0x80/0x80 [Wed Mar 30 15:21:35 2022] nmi_trigger_cpumask_backtrace+0xcb/0xe0 [Wed Mar 30 15:21:35 2022] rcu_dump_cpu_stacks+0xc8/0xf0 [Wed Mar 30 15:21:35 2022] rcu_sched_clock_irq.cold+0x60/0x2f2 [Wed Mar 30 15:21:35 2022] ? update_load_avg+0x5f/0x550 [Wed Mar 30 15:21:35 2022] ? timekeeping_update+0xab/0x280 [Wed Mar 30 15:21:35 2022] ? _raw_spin_unlock_irqrestore+0x16/0x30 [Wed Mar 30 15:21:35 2022] update_process_times+0x8e/0xc0 [Wed Mar 30 15:21:35 2022] tick_sched_handle+0x1d/0x50 [Wed Mar 30 15:21:35 2022] tick_sched_timer+0x7f/0xa0 [Wed Mar 30 15:21:35 2022] ? can_stop_idle_tick+0xd0/0xd0 [Wed Mar 30 15:21:35 2022] __hrtimer_run_queues+0x125/0x2c0 [Wed Mar 30 15:21:35 2022] hrtimer_interrupt+0x101/0x210 [Wed Mar 30 15:21:35 2022] __sysvec_apic_timer_interrupt+0x7a/0x160 [Wed Mar 30 15:21:35 2022] sysvec_apic_timer_interrupt+0x89/0xc0 [Wed Mar 30 15:21:35 2022] </IRQ> [Wed Mar 30 15:21:35 2022] <TASK> [Wed Mar 30 15:21:35 2022] asm_sysvec_apic_timer_interrupt+0x12/0x20 [Wed Mar 30 15:21:35 2022] RIP: 0010:queued_spin_lock_slowpath+0x43/0x1d0 [Wed Mar 30 15:21:35 2022] Code: 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 a9 00 01 ff ff 0f 85 fc 00 00 00 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 8b 03 <84> c0 75 f8 b8 01 00 00 00 66 89 03 5b 5d 41 5c c3 8b 37 b8 00 02 [Wed Mar 30 15:21:35 2022] RSP: 0018:ffffa1ae8446bc00 EFLAGS: 00000202 [Wed Mar 30 15:21:35 2022] RAX: 0000000000000101 RBX: ffff99a240a5b500 RCX: 0000000000000000 [Wed Mar 30 15:21:35 2022] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff99a240a5b500 [Wed Mar 30 15:21:35 2022] RBP: ffff99a27b18f278 R08: ffff99a240054ab0 R09: ffff99a240a2d13c [Wed Mar 30 15:21:35 2022] R10: 0000000000000018 R11: 0000000000000018 R12: ffffffff81e7d280 [Wed Mar 30 15:21:35 2022] R13: ffff99a27b18f300 R14: ffff99a27bb95540 R15: ffff99a27bb95598 [Wed Mar 30 15:21:35 2022] evict+0x6e/0x160 [Wed Mar 30 15:21:35 2022] __dentry_kill+0xd1/0x170 [Wed Mar 30 15:21:35 2022] d_prune_aliases+0x46/0x110 [Wed Mar 30 15:21:35 2022] bch2_evict_subvolume_inodes+0x97/0x270 [Wed Mar 30 15:21:35 2022] ? load_balance+0x177/0xbb0 [Wed Mar 30 15:21:35 2022] bch2_subvolume_wait_for_pagecache_and_delete+0x84/0x1c0 [Wed Mar 30 15:21:35 2022] ? refresh_cpu_vm_stats+0x249/0x290 [Wed Mar 30 15:21:35 2022] ? update_load_avg+0x5f/0x550 [Wed Mar 30 15:21:35 2022] ? update_load_avg+0x5f/0x550 [Wed Mar 30 15:21:35 2022] ? cpuacct_charge+0x56/0x90 [Wed Mar 30 15:21:35 2022] ? xfd_validate_state+0x19/0x70 [Wed Mar 30 15:21:35 2022] ? _raw_spin_unlock+0xd/0x20 [Wed Mar 30 15:21:35 2022] ? finish_task_switch.isra.0+0xb2/0x260 [Wed Mar 30 15:21:35 2022] process_one_work+0x1d6/0x3a0 [Wed Mar 30 15:21:35 2022] worker_thread+0x48/0x3d0 [Wed Mar 30 15:21:35 2022] ? rescuer_thread+0x390/0x390 [Wed Mar 30 15:21:35 2022] kthread+0x166/0x190 [Wed Mar 30 15:21:35 2022] ? set_kthread_struct+0x40/0x40 [Wed Mar 30 15:21:35 2022] ret_from_fork+0x22/0x30 [Wed Mar 30 15:21:35 2022] </TASK> test1 ~ # [ 1716.210112] watchdog: BUG: soft lockup - CPU#1 stuck for 45s! [kworker/1:1:86] [ 1716.217517] Kernel panic - not syncing: softlockup: hung tasks [ 1716.223388] CPU: 1 PID: 86 Comm: kworker/1:1 Tainted: G L 5.16.0-g020a223efab9+ #28 [ 1716.232388] Hardware name: System manufacturer System Product Name/P9X79 LE, BIOS 4801 07/24/2014 [ 1716.241297] Workqueue: events_long bch2_subvolume_wait_for_pagecache_and_delete [ 1716.248668] Call Trace: [ 1716.251165] <IRQ> [ 1716.253235] dump_stack_lvl+0x34/0x44 [ 1716.256954] panic+0xef/0x2b5 [ 1716.259980] watchdog_timer_fn.cold+0xc/0x16 [ 1716.264305] ? softlockup_fn+0x60/0x60 [ 1716.268107] __hrtimer_run_queues+0x125/0x2c0 [ 1716.272520] hrtimer_interrupt+0x101/0x210 [ 1716.276672] __sysvec_apic_timer_interrupt+0x7a/0x160 [ 1716.281793] sysvec_apic_timer_interrupt+0x89/0xc0 [ 1716.286650] </IRQ> [ 1716.288802] <TASK> [ 1716.290952] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 1716.296146] RIP: 0010:queued_spin_lock_slowpath+0x41/0x1d0 [ 1716.301686] Code: 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 a9 00 01 ff ff 0f 85 fc 00 00 00 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 5b 5d 41 5c c3 8b 37 b8 [ 1716.320528] RSP: 0018:ffffa1ae8446bc00 EFLAGS: 00000202 [ 1716.325813] RAX: 0000000000000101 RBX: ffff99a240a5b500 RCX: 0000000000000000 [ 1716.333002] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff99a240a5b500 [ 1716.340186] RBP: ffff99a27b18f278 R08: ffff99a240054ab0 R09: ffff99a240a2d13c [ 1716.347370] R10: 0000000000000018 R11: 0000000000000018 R12: ffffffff81e7d280 [ 1716.354555] R13: ffff99a27b18f300 R14: ffff99a27bb95540 R15: ffff99a27bb95598 [ 1716.361743] evict+0x6e/0x160 [ 1716.364768] __dentry_kill+0xd1/0x170 [ 1716.368488] d_prune_aliases+0x46/0x110 [ 1716.372379] bch2_evict_subvolume_inodes+0x97/0x270 [ 1716.377311] ? load_balance+0x177/0xbb0 [ 1716.381201] bch2_subvolume_wait_for_pagecache_and_delete+0x84/0x1c0 [ 1716.387610] ? refresh_cpu_vm_stats+0x249/0x290 [ 1716.392200] ? update_load_avg+0x5f/0x550 [ 1716.396264] ? update_load_avg+0x5f/0x550 [ 1716.400328] ? cpuacct_charge+0x56/0x90 [ 1716.404220] ? xfd_validate_state+0x19/0x70 [ 1716.408459] ? _raw_spin_unlock+0xd/0x20 [ 1716.412435] ? finish_task_switch.isra.0+0xb2/0x260 [ 1716.417378] process_one_work+0x1d6/0x3a0 [ 1716.421441] worker_thread+0x48/0x3d0 [ 1716.425157] ? rescuer_thread+0x390/0x390 [ 1716.429222] kthread+0x166/0x190 [ 1716.432514] ? set_kthread_struct+0x40/0x40 [ 1716.436754] ret_from_fork+0x22/0x30 [ 1716.440394] </TASK> [ 1716.442644] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 1716.452881] ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---
Yeah this is a known bug - I'm not going to be able to get to snapshots-related bugs for the next month or so, focusing on getting backpointers done right now