bcachefs
bcachefs copied to clipboard
Frequent reboots broke the filesystem
Seems frequent rebooting (through sudo reboot, because I am playing around with vfio and it frequently puts the card in linux-non-resettable state) managed to break a raid0 (native bcachefs raid) nvme bcachefs root.
The bcachefs setup is:
/dev/nvme0n1:/dev/nvme1n1p2
Cant tell exactly how many files are missing.
Pacman says the following:
[root@rig vytautas]# pacman -Q --check | grep -v '0 missing'
warning: filesystem: /etc/group (No such file or directory)
warning: filesystem: /etc/gshadow (No such file or directory)
filesystem: 118 total files, 2 missing files
warning: okteta: /usr/include/KastenControllers/Kasten/ModifiedBarControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/QuitControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/ReadOnlyBarControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/ReadOnlyControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/SelectControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/SetRemoteControllerFactory (No such file or directory)
okteta: 1115 total files, 6 missing files
warning: python-cycler: /usr/share/licenses/python-cycler/ (No such file or directory)
warning: python-cycler: /usr/share/licenses/python-cycler/LICENSE (No such file or directory)
python-cycler: 18 total files, 2 missing files
error: could not open file /var/lib/pacman/local/spice-protocol-0.14.2-1/files: No such file or directory
bash: __prompt: command not found
Notably, yep, /etc/group and /etc/gshadow are screwed, however, their metadata exists:
-????????? ? ? ? ? ? group
-rw-r--r-- 1 root root 1.2K Dec 1 18:44 group-
-????????? ? ? ? ? ? gshadow
-rw------- 1 root root 1002 Dec 1 18:44 gshadow-
Deleting those files yields the error
[root@rig etc]# rm group
rm: cannot remove 'group': No such file or directory
Otherwise it would be an easy fix to use the group- and gshadow- backups.
Oddly, it is not that I changed up the groups at the last good boot, so the metadata/data was fairly randomly screwed up, which does not make me happy.
There are very many random files in the zombie state:
root@rig ~]# pwd
/root
bash: __prompt: command not found
[root@rig ~]# ls bin
pcie_hot_reset.sh reset_amdgpu.sh unbind_fb
bash: __prompt: command not found
[root@rig ~]# cat bin/*
cat: bin/pcie_hot_reset.sh: No such file or directory
cat: bin/reset_amdgpu.sh: No such file or directory
#!/bin/bash
echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind
VIRSH_GPU="pci_0000_0a_00_0"
VIRSH_GPU_AUDIO="pci_0000_0a_00_1"
virsh nodedev-detach "$VIRSH_GPU"
virsh nodedev-detach "$VIRSH_GPU_AUDIO"
## Load vfio.
modprobe vfio-pci
bash: __prompt: command not found
journal is gone as well.
I guess this issue report is borderline hot garbage, but i'd imagine it could be that bcachefs has a reboot bug. I highly doubt I managed to brick the files by force-resetting the machine (which I didnt do up until I had the machine no longer bring up eth interface and be connectable over lan). To get the machine to boot correctly, I used an archlinux usb stick with a newly compiled linux-bcachefs-git kernel build (which presumably just takes the very newest bcachefs git commit), there were 2 erroring files during mounting, which I didnt record.
To actually ask a question: why is the filesystem not able to handle "ok metadata, gone data"? And how is "no such file or directory" possible if the metadata is replicated across the 2 drives?
Does anyone care to comment?
I tried running bcachefs fsck: https://gist.github.com/Eximius/5f2a70b00d61a0f8d1fb8b11dd6068d7
It doesn't make sense that the metadata is this messed up even if one of the nvme drives had a mid-block-tree-rebuild failure under a force reset?
As another comment, would be nice if fsck showed more than just the basename of the file when it reached a broken ent.:)
So... it (or similar) happened again. After a graceful reboot (after gpu froze up), the machine was no longer reachable after booting up. Booting into a usb stick gave the following.
[root@archusb tmp]# mount /dev/nvme0n1:/dev/nvme1n1p2 -t bcachefs R/
mount: /tmp/R: can't read superblock on /dev/nvme0n1:/dev/nvme1n1p2.
[ 474.913837] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): duplicate journal entries on same device, exiting
[ 474.917014] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): Unable to continue, halting
[ 474.975664] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): nvme1n1p2 sector 975559 seq 4399496: journal checksum bad, exiting
[ 474.981790] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): Unable to continue, halting
[ 474.988344] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): Error in recovery: cannot allocate memory (1)
[ 474.994286] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): filesystem contains errors: please report this to the developers
So would you say the claim "The COW filesystem for Linux that won't eat your data" is destroyed?
That is still in future tense though :)
I have no idea what's happening here. I just had a minor issue again that needed a live usb. I didnt write it down, but it was "journal replication flag poorly set". Reading through the whole journal (of the on-ssd-raid linux) generated some checksum errors in the middle of the journal.
While, I guess, the corruption can be attributed to force-resets, the very last failure seems to have happened because of a cpu freeze.
Does bcachefs version mismatch between live usb kernel and disk kernel have an effect? (So I am just creating the problem on live-usb mounting?) ssd-raid linux: Linux rig 5.10.5-arch1-1-bcachefs-git-307302-gfcf8a0889c12 live-usb linux: Linux 5.10.4-arch2-1-bcachefs-git-307298-g91e7a706fd4f
The last lines of the journal before the pc went awol:
an 15 23:57:01 rig kernel: Code: 0d a4 95 19 58 74 01 c3 e8 73 60 58 ff c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57>
Jan 15 23:57:01 rig kernel: RSP: 0018:ffffa2188052ce98 EFLAGS: 00000246
Jan 15 23:57:01 rig kernel: RAX: 0000000000000001 RBX: ffff90f98a848808 RCX: 0000000000000000
Jan 15 23:57:01 rig kernel: RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
Jan 15 23:57:01 rig kernel: RBP: ffff90f98a848f28 R08: 0000000000000000 R09: ffffa2188052cc60
Jan 15 23:57:01 rig kernel: R10: 00000000016828d8 R11: 00000000016829d0 R12: 00000000ffffffff
Jan 15 23:57:01 rig kernel: R13: ffff90f98a848808 R14: ffff90f98a848f28 R15: ffff91182eb1cf80
Jan 15 23:57:01 rig kernel: FS: 0000000000000000(0000) GS:ffff91182eb00000(0000) knlGS:0000000000000000
Jan 15 23:57:01 rig kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 15 23:57:01 rig kernel: CR2: 0000559836336868 CR3: 0000001ec8810000 CR4: 0000000000350ee0
Jan 15 23:57:01 rig kernel: Call Trace:
Jan 15 23:57:01 rig kernel: <IRQ>
Jan 15 23:57:01 rig kernel: iova_domain_flush+0x1a/0x30
Jan 15 23:57:01 rig kernel: fq_flush_timeout+0x2e/0xa0
Jan 15 23:57:01 rig kernel: ? fq_ring_free+0xf0/0xf0
Jan 15 23:57:01 rig kernel: ? fq_ring_free+0xf0/0xf0
Jan 15 23:57:01 rig kernel: call_timer_fn+0x29/0x130
Jan 15 23:57:01 rig kernel: __run_timers+0x1eb/0x270
Jan 15 23:57:01 rig kernel: run_timer_softirq+0x19/0x30
Jan 15 23:57:01 rig kernel: __do_softirq+0xc8/0x2b5
Jan 15 23:57:01 rig kernel: asm_call_irq_on_stack+0xf/0x20
Jan 15 23:57:01 rig kernel: </IRQ>
Jan 15 23:57:01 rig kernel: do_softirq_own_stack+0x37/0x40
Jan 15 23:57:01 rig kernel: irq_exit_rcu+0x9c/0xd0
Jan 15 23:57:01 rig kernel: sysvec_apic_timer_interrupt+0x36/0x80
Jan 15 23:57:01 rig kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Jan 15 23:57:01 rig kernel: RIP: 0010:cpuidle_enter_state+0xc0/0x360
Jan 15 23:57:01 rig kernel: Code: 3d b5 d8 40 58 e8 00 23 8c ff 49 89 c5 0f 1f 44 00 00 31 ff e8 11 30 8c ff 41 83 e7 01 0f 85 cc 01 00 00 fb 66 0f 1f 44 00>
Jan 15 23:57:01 rig kernel: RSP: 0018:ffffa218801bfea8 EFLAGS: 00000246
Jan 15 23:57:01 rig kernel: RAX: ffff91182eb2c180 RBX: ffff90f98b8c9c00 RCX: 000000000000001f
Jan 15 23:57:01 rig kernel: RDX: 0000000000000000 RSI: 0000000024a3dfd3 RDI: 0000000000000000
Jan 15 23:57:01 rig kernel: RBP: 0000000000000002 R08: 0000c2952db70241 R09: 0000c294fd8959a8
Jan 15 23:57:01 rig kernel: R10: 0000000000000344 R11: 0000000000000924 R12: ffffffffa8f48540
Jan 15 23:57:01 rig kernel: R13: 0000c2952db70241 R14: 0000000000000002 R15: 0000000000000000
Jan 15 23:57:01 rig kernel: ? cpuidle_enter_state+0xaf/0x360
Jan 15 23:57:01 rig kernel: cpuidle_enter+0x29/0x40
Jan 15 23:57:01 rig kernel: do_idle+0x1e3/0x280
Jan 15 23:57:01 rig kernel: cpu_startup_entry+0x19/0x20
Jan 15 23:57:01 rig kernel: secondary_startup_64_no_verify+0xc2/0xcb
Jan 15 23:57:01 rig kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x1001f96e0]
Jan 15 23:57:01 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:02 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:02 rig kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [libx2go-server-:1822429]
Jan 15 23:57:02 rig kernel: Modules linked in: tcp_diag inet_diag unix_diag xt_nat vhost_net vhost vhost_iotlb tap nf_conntrack_netlink nfnetlink xt_addrtyp>
Jan 15 23:57:02 rig kernel: eeepc_wmi r8169 cec asus_wmi snd_timer ccp syscopyarea realtek crct10dif_pclmul snd mdio_devres ghash_clmulni_intel sparse_keym>
Jan 15 23:57:02 rig kernel: CPU: 5 PID: 1822429 Comm: libx2go-server- Tainted: G OEL 5.10.5-arch1-1-bcachefs-git-307302-gfcf8a0889c12 #3
Jan 15 23:57:02 rig kernel: Hardware name: ASUS System Product Name/TUF GAMING B550M-PLUS (WI-FI), BIOS 0803 06/30/2020
Jan 15 23:57:02 rig kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x30
Jan 15 23:57:02 rig kernel: Code: 0d a4 95 19 58 74 01 c3 e8 73 60 58 ff c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57>
Jan 15 23:57:02 rig kernel: RSP: 0000:ffffa21880388e98 EFLAGS: 00000246
Jan 15 23:57:02 rig kernel: RAX: 0000000000000001 RBX: ffff90f98a789808 RCX: 0000000000000000
Jan 15 23:57:02 rig kernel: RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
Jan 15 23:57:02 rig kernel: RBP: ffff90f98a789f28 R08: 0000000000000000 R09: ffffa21880388c60
Jan 15 23:57:02 rig kernel: R10: 0000000001685648 R11: 0000000001685738 R12: 00000000ffffffff
Jan 15 23:57:02 rig kernel: R13: ffff90f98a789808 R14: ffff90f98a789f28 R15: ffff91182e95cf80
Jan 15 23:57:02 rig kernel: FS: 00007f0f0a2da740(0000) GS:ffff91182e940000(0000) knlGS:0000000000000000
Jan 15 23:57:02 rig kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 15 23:57:02 rig kernel: CR2: 0000559fbb91d098 CR3: 0000000e37312000 CR4: 0000000000350ee0
Jan 15 23:57:02 rig kernel: Call Trace:
Jan 15 23:57:02 rig kernel: <IRQ>
Jan 15 23:57:02 rig kernel: iova_domain_flush+0x1a/0x30
Jan 15 23:57:02 rig kernel: fq_flush_timeout+0x2e/0xa0
Jan 15 23:57:02 rig kernel: ? fq_ring_free+0xf0/0xf0
Jan 15 23:57:02 rig kernel: ? fq_ring_free+0xf0/0xf0
Jan 15 23:57:02 rig kernel: call_timer_fn+0x29/0x130
Jan 15 23:57:02 rig kernel: __run_timers+0x1eb/0x270
Jan 15 23:57:02 rig kernel: run_timer_softirq+0x19/0x30
Jan 15 23:57:02 rig kernel: __do_softirq+0xc8/0x2b5
Jan 15 23:57:02 rig kernel: asm_call_irq_on_stack+0xf/0x20
Jan 15 23:57:02 rig kernel: </IRQ>
Jan 15 23:57:02 rig kernel: do_softirq_own_stack+0x37/0x40
Jan 15 23:57:02 rig kernel: irq_exit_rcu+0x9c/0xd0
Jan 15 23:57:02 rig kernel: sysvec_apic_timer_interrupt+0x36/0x80
Jan 15 23:57:02 rig kernel: ? asm_sysvec_apic_timer_interrupt+0xa/0x20
Jan 15 23:57:02 rig kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Jan 15 23:57:02 rig kernel: RIP: 0033:0x7f0f0a720af1
Jan 15 23:57:02 rig kernel: Code: 05 00 00 0f b6 45 22 83 e0 b3 3c 22 0f 84 93 06 00 00 48 85 db 0f 85 05 01 00 00 0f b7 45 20 89 c2 66 81 e2 ff 01 66 83 fa>
Jan 15 23:57:02 rig kernel: RSP: 002b:00007ffe2e527230 EFLAGS: 00000297
Jan 15 23:57:02 rig kernel: RAX: 0000000000000609 RBX: 0000559fbb91d710 RCX: 0000000000004496
Jan 15 23:57:02 rig kernel: RDX: 0000000000000009 RSI: 0000559fbb914c88 RDI: 0000559fbb5442a0
Jan 15 23:57:02 rig kernel: RBP: 0000559fbb914c88 R08: 0000559fbb544a40 R09: 0000000000000001
Jan 15 23:57:02 rig kernel: R10: 0000559fbb914ba8 R11: 0000000000000040 R12: 0000000000000000
Jan 15 23:57:02 rig kernel: R13: 00007f0f0a88cc38 R14: 0000559fbb5442a0 R15: 0000559fbb914c50
...
Jan 15 23:57:06 rig kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Jan 15 23:57:06 rig kernel: rcu: 5-....: (1 GPs behind) idle=fc2/1/0x4000000000000000 softirq=7328538/7328539 fqs=5515 last_accelerate: fcbf/4344 dy>
Jan 15 23:57:06 rig kernel: rcu: 12-....: (1 GPs behind) idle=56a/1/0x4000000000000002 softirq=7574158/7574159 fqs=5515 last_accelerate: fcc9/4344 d>
Jan 15 23:57:06 rig kernel: (detected by 11, t=18002 jiffies, g=14072169, q=74468)
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x1001f99b0]
Jan 15 23:57:07 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:07 rig kernel: AMD-Vi: Completion-Wait loop timed out
Ahaha! The virtualization again locked up the machine. After 2 force-resets, and into live-usb, I have:
bcachefs: duplicate journal entries on same device, fixing
bcachefs: superblock not marked as containing replicas journal: 1/1 [1], fixing
Do you have an idea which steps are needed to reproduce it?
I have one nvme disk as root fs and rebootet 15 times, no issues. Maybe it happens only with raid?
The journal replication issue is likely when you have a raid0 in bcachefs.
bcachefs setup:
External UUID: 379774e3-3bc1-437f-81ca-970b92d7846b
Internal UUID: 3ad677e8-9ab0-4a84-b405-9786dc60b160
Label:
Version: 11
Created: Mon Nov 23 02:03:52 2020
Squence number: 376
Block_size: 512
Btree node size: 256.0K
Error action: remount-ro
Clean: 0
Features: atomic_nlink,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data
Metadata replicas: 2
Data replicas: 1
Metadata checksum type: crc32c (1)
Data checksum type: crc32c (1)
Compression type: none (0)
Foreground write target: none
Background write target: none
Promote target: none
String hash type: siphash (2)
32 bit inodes: 0
GC reserve percentage: 8%
Root reserve percentage: 0%
Devices: 2 live, 2 total
Sections: journal,members,replicas_v0,disk_groups,clean,journal_seq_blacklist
Superblock size: 11616
Members (size 120):
Device 0:
UUID: a125bf98-67e1-497b-88c6-b85a8bc84f81
Size: 931.5G
Bucket size: 512.0K
First bucket: 0
Buckets: 1907739
Last mount: Sat Jan 16 14:01:39 2021
State: readwrite
Group: nvme (0)
Data allowed: journal,btree,user
Has data: (none)
Replacement policy: lru
Discard: 0
Device 1:
UUID: 3fb436fa-e472-4177-912b-c201102f0b18
Size: 651.7G
Bucket size: 512.0K
First bucket: 0
Buckets: 1334656
Last mount: Sat Jan 16 14:01:39 2021
State: readwrite
Group: nvme (0)
Data allowed: journal,btree,user
Has data: (none)
Replacement policy: lru
Discard: 0
Don't exactly know what causes it. But the set of things that are suspect are: AMD-VI, vfio, pci freeze, cpu freeze, bcachefs, bcachefs-live-usb-version-mismatch.
I will track this. Maybe I can produce more useful bug information as things drag on.
I ran fsck on the raid, because I was still getting data checksum errors:
[root@archusb ~]# bcachefs fsck -fp /dev/nvme0n1 /dev/nvme1n1p2
recovering from clean shutdown, journal seq 6401305
journal read done, 0 keys in 1 entries, seq 6401306
starting mark and sweep
bucket 0:174677 gen 2 data type none has wrong data_type: got 0, should be 4, fixing
bucket 0:174677 gen 2 data type user has wrong dirty_sectors: got 0, should be 40, fixing
bucket 0:175203 gen 2 data type user has wrong data_type: got 4, should be 0, fixing
bucket 0:175203 gen 2 data type none has wrong dirty_sectors: got 384, should be 0, fixing
bucket 0:175519 gen 2 data type user has wrong dirty_sectors: got 141, should be 8, fixing
bucket 0:1346072 gen 1 data type user has wrong dirty_sectors: got 77, should be 205, fixing
bucket 0:1700423 gen 1 data type none has wrong data_type: got 0, should be 4, fixing
bucket 0:1700423 gen 1 data type user has wrong dirty_sectors: got 0, should be 8, fixing
bucket 1:945708 gen 1 data type user has wrong dirty_sectors: got 6, should be 134, fixing
bucket 1:945709 gen 1 data type user has wrong dirty_sectors: got 32, should be 160, fixing
bucket 1:1167945 gen 1 data type none has wrong data_type: got 0, should be 4, fixing
bucket 1:1167945 gen 1 data type user has wrong dirty_sectors: got 0, should be 24, fixing
bucket 1:1194422 gen 1 data type none has wrong data_type: got 0, should be 4, fixing
bucket 1:1194422 gen 1 data type user has wrong dirty_sectors: got 0, should be 8, fixing
bucket 1:1247964 gen 1 data type user has wrong dirty_sectors: got 101, should be 205, fixing
bucket 1:1270159 gen 1 data type none has wrong data_type: got 0, should be 4, fixing
bucket 1:1270159 gen 1 data type user has wrong dirty_sectors: got 0, should be 8, fixing
bucket 1:1270325 gen 1 data type user has wrong dirty_sectors: got 32, should be 56, fixing
bucket 1:1270385 gen 1 data type user has wrong dirty_sectors: got 24, should be 8, fixing
bucket 1:1270618 gen 1 data type user has wrong dirty_sectors: got 211, should be 144, fixing
fs has wrong user: 1/1 [1]: got 613486459, should be 613486800, fixing
fs has wrong user: 1/1 [0]: got 877929170, should be 877928829, fixing
starting fsck
dirent points to missing inode:
u64s 9 type dirent 456110:4023759923222321886 snap 0 len 0 ver 0: default.addnhosts -> 6846 type 8, fixing
hash table key at wrong offset: btree 3, 161933195682246499, hashed to 2348500190113702514 chain starts at 161933195682246499
u64s 8 type xattr 6816:161933195682246499 snap 0 len 0 ver 0: user.coredump.pid:1563, fixing
hash_redo_key err -17
Error in recovery: error in fsck (-17)
error opening /dev/nvme0n1: File exists
As one might expect, checksum errors persist after fsck:
[ 77.128108] bcachefs (nvme0n1 inum 1744830464 offset 968): data checksum error: expected 0:e4b4dbdb got 0:4a7d7ad0 (type 5)
[ 77.128130] __bch2_read_extent: 6 callbacks suppressed
[ 77.128131] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[ 77.128228] bcachefs (nvme0n1 inum 1744830464 offset 968): data checksum error: expected 0:e4b4dbdb got 0:4a7d7ad0 (type 5)
[ 77.128233] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[ 77.128366] bcachefs (nvme0n1 inum 1744830464 offset 11376): data checksum error: expected 0:1482484f got 0:1466fa56 (type 5)
[ 77.128372] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[ 77.128480] bcachefs (nvme0n1 inum 1744830464 offset 11376): data checksum error: expected 0:1482484f got 0:1466fa56 (type 5)
[ 77.128485] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[ 77.128595] bcachefs (nvme0n1 inum 1744830464 offset 944): data checksum error: expected 0:bc5ab7c1 got 0:e613a192 (type 5)
[ 77.128620] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[ 77.128715] bcachefs (nvme0n1 inum 1744830464 offset 944): data checksum error: expected 0:bc5ab7c1 got 0:e613a192 (type 5)
[ 77.128724] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[ 77.128843] bcachefs (nvme0n1 inum 1744830464 offset 10704): data checksum error: expected 0:7821596 got 0:bc4d42f (type 5)
[ 77.128852] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[ 77.128960] bcachefs (nvme0n1 inum 1744830464 offset 10704): data checksum error: expected 0:7821596 got 0:bc4d42f (type 5)
[ 77.128968] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[ 77.141696] bcachefs (nvme0n1 inum 1744830464 offset 968): data checksum error: expected 0:e4b4dbdb got 0:4a7d7ad0 (type 5)
[ 77.141723] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[ 77.141829] bcachefs (nvme0n1 inum 1744830464 offset 968): data checksum error: expected 0:e4b4dbdb got 0:4a7d7ad0 (type 5)
[ 77.141841] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
Is there a way to find out which file is the inum pointing to?
Clean reboot after navi gpu lockup. (I would remove the pci device for the gpu and try a soft reset. The whole system is responsive.)
[ 23.581527] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): error validating btree node at btree inodes level 0/1
u64s 12 type btree_ptr_v2 0:2132469 snap 0 len 0 ver 0: seq b8492c8a9e30821b sectors 512 written 0 min_key 0:2129689 ptr: 0:116561408 gen 3 ptr: 1:116529664 gen 4
node offset 0: bad magic
[ 23.581535] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): retrying read
[ 23.581777] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): error validating btree node at btree inodes level 0/1
u64s 12 type btree_ptr_v2 0:2132469 snap 0 len 0 ver 0: seq b8492c8a9e30821b sectors 512 written 0 min_key 0:2129689 ptr: 0:116561408 gen 3 ptr: 1:116529664 gen 4
node offset 0: bad magic
I am greeted with files missing:
$ virsh
zsh: Input/output error: virsh
Since the raid has 2 metadata replicas (on top of cow), I'd call this "very bad".
It is a bit too painful having ordinary reboots kill the filesystem. So I will switch to something else. Also, having at least the filenames that are permanently removed shown on fix_errors is extremely mandatory.