bcachefs icon indicating copy to clipboard operation
bcachefs copied to clipboard

Frequent reboots broke the filesystem

Open Eximius opened this issue 4 years ago • 11 comments

Seems frequent rebooting (through sudo reboot, because I am playing around with vfio and it frequently puts the card in linux-non-resettable state) managed to break a raid0 (native bcachefs raid) nvme bcachefs root.

The bcachefs setup is:

/dev/nvme0n1:/dev/nvme1n1p2

Cant tell exactly how many files are missing.

Pacman says the following:

[root@rig vytautas]# pacman -Q --check | grep -v '0 missing'
warning: filesystem: /etc/group (No such file or directory)
warning: filesystem: /etc/gshadow (No such file or directory)
filesystem: 118 total files, 2 missing files
warning: okteta: /usr/include/KastenControllers/Kasten/ModifiedBarControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/QuitControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/ReadOnlyBarControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/ReadOnlyControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/SelectControllerFactory (No such file or directory)
warning: okteta: /usr/include/KastenControllers/Kasten/SetRemoteControllerFactory (No such file or directory)
okteta: 1115 total files, 6 missing files
warning: python-cycler: /usr/share/licenses/python-cycler/ (No such file or directory)
warning: python-cycler: /usr/share/licenses/python-cycler/LICENSE (No such file or directory)
python-cycler: 18 total files, 2 missing files
error: could not open file /var/lib/pacman/local/spice-protocol-0.14.2-1/files: No such file or directory
bash: __prompt: command not found

Notably, yep, /etc/group and /etc/gshadow are screwed, however, their metadata exists:

-?????????   ? ?          ?       ?            ? group
-rw-r--r--   1 root       root 1.2K Dec  1 18:44 group-
-?????????   ? ?          ?       ?            ? gshadow
-rw-------   1 root       root 1002 Dec  1 18:44 gshadow-

Deleting those files yields the error

[root@rig etc]# rm group
rm: cannot remove 'group': No such file or directory

Otherwise it would be an easy fix to use the group- and gshadow- backups.

Oddly, it is not that I changed up the groups at the last good boot, so the metadata/data was fairly randomly screwed up, which does not make me happy.

There are very many random files in the zombie state:

root@rig ~]# pwd
/root
bash: __prompt: command not found
[root@rig ~]# ls bin
pcie_hot_reset.sh  reset_amdgpu.sh  unbind_fb
bash: __prompt: command not found
[root@rig ~]# cat bin/*
cat: bin/pcie_hot_reset.sh: No such file or directory
cat: bin/reset_amdgpu.sh: No such file or directory
#!/bin/bash 

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

VIRSH_GPU="pci_0000_0a_00_0"
VIRSH_GPU_AUDIO="pci_0000_0a_00_1"

virsh nodedev-detach "$VIRSH_GPU"
virsh nodedev-detach "$VIRSH_GPU_AUDIO"

## Load vfio.
modprobe vfio-pci
bash: __prompt: command not found

journal is gone as well.

I guess this issue report is borderline hot garbage, but i'd imagine it could be that bcachefs has a reboot bug. I highly doubt I managed to brick the files by force-resetting the machine (which I didnt do up until I had the machine no longer bring up eth interface and be connectable over lan). To get the machine to boot correctly, I used an archlinux usb stick with a newly compiled linux-bcachefs-git kernel build (which presumably just takes the very newest bcachefs git commit), there were 2 erroring files during mounting, which I didnt record.

To actually ask a question: why is the filesystem not able to handle "ok metadata, gone data"? And how is "no such file or directory" possible if the metadata is replicated across the 2 drives?

Does anyone care to comment?

Eximius avatar Jan 07 '21 22:01 Eximius

I tried running bcachefs fsck: https://gist.github.com/Eximius/5f2a70b00d61a0f8d1fb8b11dd6068d7

It doesn't make sense that the metadata is this messed up even if one of the nvme drives had a mid-block-tree-rebuild failure under a force reset?

Eximius avatar Jan 07 '21 23:01 Eximius

As another comment, would be nice if fsck showed more than just the basename of the file when it reached a broken ent.:)

Eximius avatar Jan 07 '21 23:01 Eximius

So... it (or similar) happened again. After a graceful reboot (after gpu froze up), the machine was no longer reachable after booting up. Booting into a usb stick gave the following.

[root@archusb tmp]# mount /dev/nvme0n1:/dev/nvme1n1p2 -t bcachefs R/
mount: /tmp/R: can't read superblock on /dev/nvme0n1:/dev/nvme1n1p2.
[  474.913837] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): duplicate journal entries on same device, exiting
[  474.917014] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): Unable to continue, halting
[  474.975664] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): nvme1n1p2 sector 975559 seq 4399496: journal checksum bad, exiting
[  474.981790] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): Unable to continue, halting
[  474.988344] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): Error in recovery: cannot allocate memory (1)
[  474.994286] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): filesystem contains errors: please report this to the developers

Eximius avatar Jan 10 '21 16:01 Eximius

So would you say the claim "The COW filesystem for Linux that won't eat your data" is destroyed?

davidak avatar Jan 16 '21 10:01 davidak

That is still in future tense though :)

I have no idea what's happening here. I just had a minor issue again that needed a live usb. I didnt write it down, but it was "journal replication flag poorly set". Reading through the whole journal (of the on-ssd-raid linux) generated some checksum errors in the middle of the journal.

While, I guess, the corruption can be attributed to force-resets, the very last failure seems to have happened because of a cpu freeze.

Does bcachefs version mismatch between live usb kernel and disk kernel have an effect? (So I am just creating the problem on live-usb mounting?) ssd-raid linux: Linux rig 5.10.5-arch1-1-bcachefs-git-307302-gfcf8a0889c12 live-usb linux: Linux 5.10.4-arch2-1-bcachefs-git-307298-g91e7a706fd4f

The last lines of the journal before the pc went awol:

an 15 23:57:01 rig kernel: Code: 0d a4 95 19 58 74 01 c3 e8 73 60 58 ff c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57>
Jan 15 23:57:01 rig kernel: RSP: 0018:ffffa2188052ce98 EFLAGS: 00000246
Jan 15 23:57:01 rig kernel: RAX: 0000000000000001 RBX: ffff90f98a848808 RCX: 0000000000000000
Jan 15 23:57:01 rig kernel: RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
Jan 15 23:57:01 rig kernel: RBP: ffff90f98a848f28 R08: 0000000000000000 R09: ffffa2188052cc60
Jan 15 23:57:01 rig kernel: R10: 00000000016828d8 R11: 00000000016829d0 R12: 00000000ffffffff
Jan 15 23:57:01 rig kernel: R13: ffff90f98a848808 R14: ffff90f98a848f28 R15: ffff91182eb1cf80
Jan 15 23:57:01 rig kernel: FS:  0000000000000000(0000) GS:ffff91182eb00000(0000) knlGS:0000000000000000
Jan 15 23:57:01 rig kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 15 23:57:01 rig kernel: CR2: 0000559836336868 CR3: 0000001ec8810000 CR4: 0000000000350ee0
Jan 15 23:57:01 rig kernel: Call Trace:
Jan 15 23:57:01 rig kernel:  <IRQ>
Jan 15 23:57:01 rig kernel:  iova_domain_flush+0x1a/0x30
Jan 15 23:57:01 rig kernel:  fq_flush_timeout+0x2e/0xa0
Jan 15 23:57:01 rig kernel:  ? fq_ring_free+0xf0/0xf0
Jan 15 23:57:01 rig kernel:  ? fq_ring_free+0xf0/0xf0
Jan 15 23:57:01 rig kernel:  call_timer_fn+0x29/0x130
Jan 15 23:57:01 rig kernel:  __run_timers+0x1eb/0x270
Jan 15 23:57:01 rig kernel:  run_timer_softirq+0x19/0x30
Jan 15 23:57:01 rig kernel:  __do_softirq+0xc8/0x2b5
Jan 15 23:57:01 rig kernel:  asm_call_irq_on_stack+0xf/0x20
Jan 15 23:57:01 rig kernel:  </IRQ>
Jan 15 23:57:01 rig kernel:  do_softirq_own_stack+0x37/0x40
Jan 15 23:57:01 rig kernel:  irq_exit_rcu+0x9c/0xd0
Jan 15 23:57:01 rig kernel:  sysvec_apic_timer_interrupt+0x36/0x80
Jan 15 23:57:01 rig kernel:  asm_sysvec_apic_timer_interrupt+0x12/0x20
Jan 15 23:57:01 rig kernel: RIP: 0010:cpuidle_enter_state+0xc0/0x360
Jan 15 23:57:01 rig kernel: Code: 3d b5 d8 40 58 e8 00 23 8c ff 49 89 c5 0f 1f 44 00 00 31 ff e8 11 30 8c ff 41 83 e7 01 0f 85 cc 01 00 00 fb 66 0f 1f 44 00>
Jan 15 23:57:01 rig kernel: RSP: 0018:ffffa218801bfea8 EFLAGS: 00000246
Jan 15 23:57:01 rig kernel: RAX: ffff91182eb2c180 RBX: ffff90f98b8c9c00 RCX: 000000000000001f
Jan 15 23:57:01 rig kernel: RDX: 0000000000000000 RSI: 0000000024a3dfd3 RDI: 0000000000000000
Jan 15 23:57:01 rig kernel: RBP: 0000000000000002 R08: 0000c2952db70241 R09: 0000c294fd8959a8
Jan 15 23:57:01 rig kernel: R10: 0000000000000344 R11: 0000000000000924 R12: ffffffffa8f48540
Jan 15 23:57:01 rig kernel: R13: 0000c2952db70241 R14: 0000000000000002 R15: 0000000000000000
Jan 15 23:57:01 rig kernel:  ? cpuidle_enter_state+0xaf/0x360
Jan 15 23:57:01 rig kernel:  cpuidle_enter+0x29/0x40
Jan 15 23:57:01 rig kernel:  do_idle+0x1e3/0x280
Jan 15 23:57:01 rig kernel:  cpu_startup_entry+0x19/0x20
Jan 15 23:57:01 rig kernel:  secondary_startup_64_no_verify+0xc2/0xcb
Jan 15 23:57:01 rig kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x1001f96e0]
Jan 15 23:57:01 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:02 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:02 rig kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [libx2go-server-:1822429]
Jan 15 23:57:02 rig kernel: Modules linked in: tcp_diag inet_diag unix_diag xt_nat vhost_net vhost vhost_iotlb tap nf_conntrack_netlink nfnetlink xt_addrtyp>
Jan 15 23:57:02 rig kernel:  eeepc_wmi r8169 cec asus_wmi snd_timer ccp syscopyarea realtek crct10dif_pclmul snd mdio_devres ghash_clmulni_intel sparse_keym>
Jan 15 23:57:02 rig kernel: CPU: 5 PID: 1822429 Comm: libx2go-server- Tainted: G           OEL    5.10.5-arch1-1-bcachefs-git-307302-gfcf8a0889c12 #3
Jan 15 23:57:02 rig kernel: Hardware name: ASUS System Product Name/TUF GAMING B550M-PLUS (WI-FI), BIOS 0803 06/30/2020
Jan 15 23:57:02 rig kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x30
Jan 15 23:57:02 rig kernel: Code: 0d a4 95 19 58 74 01 c3 e8 73 60 58 ff c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57>
Jan 15 23:57:02 rig kernel: RSP: 0000:ffffa21880388e98 EFLAGS: 00000246
Jan 15 23:57:02 rig kernel: RAX: 0000000000000001 RBX: ffff90f98a789808 RCX: 0000000000000000
Jan 15 23:57:02 rig kernel: RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
Jan 15 23:57:02 rig kernel: RBP: ffff90f98a789f28 R08: 0000000000000000 R09: ffffa21880388c60
Jan 15 23:57:02 rig kernel: R10: 0000000001685648 R11: 0000000001685738 R12: 00000000ffffffff
Jan 15 23:57:02 rig kernel: R13: ffff90f98a789808 R14: ffff90f98a789f28 R15: ffff91182e95cf80
Jan 15 23:57:02 rig kernel: FS:  00007f0f0a2da740(0000) GS:ffff91182e940000(0000) knlGS:0000000000000000
Jan 15 23:57:02 rig kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 15 23:57:02 rig kernel: CR2: 0000559fbb91d098 CR3: 0000000e37312000 CR4: 0000000000350ee0
Jan 15 23:57:02 rig kernel: Call Trace:
Jan 15 23:57:02 rig kernel:  <IRQ>
Jan 15 23:57:02 rig kernel:  iova_domain_flush+0x1a/0x30
Jan 15 23:57:02 rig kernel:  fq_flush_timeout+0x2e/0xa0
Jan 15 23:57:02 rig kernel:  ? fq_ring_free+0xf0/0xf0
Jan 15 23:57:02 rig kernel:  ? fq_ring_free+0xf0/0xf0
Jan 15 23:57:02 rig kernel:  call_timer_fn+0x29/0x130
Jan 15 23:57:02 rig kernel:  __run_timers+0x1eb/0x270
Jan 15 23:57:02 rig kernel:  run_timer_softirq+0x19/0x30
Jan 15 23:57:02 rig kernel:  __do_softirq+0xc8/0x2b5
Jan 15 23:57:02 rig kernel:  asm_call_irq_on_stack+0xf/0x20
Jan 15 23:57:02 rig kernel:  </IRQ>
Jan 15 23:57:02 rig kernel:  do_softirq_own_stack+0x37/0x40
Jan 15 23:57:02 rig kernel:  irq_exit_rcu+0x9c/0xd0
Jan 15 23:57:02 rig kernel:  sysvec_apic_timer_interrupt+0x36/0x80
Jan 15 23:57:02 rig kernel:  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
Jan 15 23:57:02 rig kernel:  asm_sysvec_apic_timer_interrupt+0x12/0x20
Jan 15 23:57:02 rig kernel: RIP: 0033:0x7f0f0a720af1
Jan 15 23:57:02 rig kernel: Code: 05 00 00 0f b6 45 22 83 e0 b3 3c 22 0f 84 93 06 00 00 48 85 db 0f 85 05 01 00 00 0f b7 45 20 89 c2 66 81 e2 ff 01 66 83 fa>
Jan 15 23:57:02 rig kernel: RSP: 002b:00007ffe2e527230 EFLAGS: 00000297
Jan 15 23:57:02 rig kernel: RAX: 0000000000000609 RBX: 0000559fbb91d710 RCX: 0000000000004496
Jan 15 23:57:02 rig kernel: RDX: 0000000000000009 RSI: 0000559fbb914c88 RDI: 0000559fbb5442a0
Jan 15 23:57:02 rig kernel: RBP: 0000559fbb914c88 R08: 0000559fbb544a40 R09: 0000000000000001
Jan 15 23:57:02 rig kernel: R10: 0000559fbb914ba8 R11: 0000000000000040 R12: 0000000000000000
Jan 15 23:57:02 rig kernel: R13: 00007f0f0a88cc38 R14: 0000559fbb5442a0 R15: 0000559fbb914c50
...
Jan 15 23:57:06 rig kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Jan 15 23:57:06 rig kernel: rcu:         5-....: (1 GPs behind) idle=fc2/1/0x4000000000000000 softirq=7328538/7328539 fqs=5515 last_accelerate: fcbf/4344 dy>
Jan 15 23:57:06 rig kernel: rcu:         12-....: (1 GPs behind) idle=56a/1/0x4000000000000002 softirq=7574158/7574159 fqs=5515 last_accelerate: fcc9/4344 d>
Jan 15 23:57:06 rig kernel:         (detected by 11, t=18002 jiffies, g=14072169, q=74468)
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:06 rig kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x1001f99b0]
Jan 15 23:57:07 rig kernel: AMD-Vi: Completion-Wait loop timed out
Jan 15 23:57:07 rig kernel: AMD-Vi: Completion-Wait loop timed out

Eximius avatar Jan 16 '21 12:01 Eximius

Ahaha! The virtualization again locked up the machine. After 2 force-resets, and into live-usb, I have:

bcachefs: duplicate journal entries on same device, fixing
bcachefs: superblock not marked as containing replicas journal: 1/1 [1], fixing

Eximius avatar Jan 16 '21 13:01 Eximius

Do you have an idea which steps are needed to reproduce it?

I have one nvme disk as root fs and rebootet 15 times, no issues. Maybe it happens only with raid?

davidak avatar Jan 16 '21 15:01 davidak

The journal replication issue is likely when you have a raid0 in bcachefs.

bcachefs setup:

External UUID:                  379774e3-3bc1-437f-81ca-970b92d7846b
Internal UUID:                  3ad677e8-9ab0-4a84-b405-9786dc60b160
Label:
Version:                        11
Created:                        Mon Nov 23 02:03:52 2020
Squence number:                 376
Block_size:                     512
Btree node size:                256.0K
Error action:                   remount-ro
Clean:                          0
Features:                       atomic_nlink,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data
Metadata replicas:              2
Data replicas:                  1
Metadata checksum type:         crc32c (1)
Data checksum type:             crc32c (1)
Compression type:               none (0)
Foreground write target:        none
Background write target:        none
Promote target:                 none
String hash type:               siphash (2)
32 bit inodes:                  0
GC reserve percentage:          8%
Root reserve percentage:        0%
Devices:                        2 live, 2 total
Sections:                       journal,members,replicas_v0,disk_groups,clean,journal_seq_blacklist
Superblock size:                11616

Members (size 120):
  Device 0:
    UUID:                       a125bf98-67e1-497b-88c6-b85a8bc84f81
    Size:                       931.5G
    Bucket size:                512.0K
    First bucket:               0
    Buckets:                    1907739
    Last mount:                 Sat Jan 16 14:01:39 2021
    State:                      readwrite
    Group:                      nvme (0)
    Data allowed:               journal,btree,user
    Has data:                   (none)
    Replacement policy:         lru
    Discard:                    0
  Device 1:
    UUID:                       3fb436fa-e472-4177-912b-c201102f0b18
    Size:                       651.7G
    Bucket size:                512.0K
    First bucket:               0
    Buckets:                    1334656
    Last mount:                 Sat Jan 16 14:01:39 2021
    State:                      readwrite
    Group:                      nvme (0)
    Data allowed:               journal,btree,user
    Has data:                   (none)
    Replacement policy:         lru
    Discard:                    0

Don't exactly know what causes it. But the set of things that are suspect are: AMD-VI, vfio, pci freeze, cpu freeze, bcachefs, bcachefs-live-usb-version-mismatch.

I will track this. Maybe I can produce more useful bug information as things drag on.

Eximius avatar Jan 16 '21 15:01 Eximius

I ran fsck on the raid, because I was still getting data checksum errors:

[root@archusb ~]# bcachefs fsck -fp /dev/nvme0n1 /dev/nvme1n1p2 
recovering from clean shutdown, journal seq 6401305
journal read done, 0 keys in 1 entries, seq 6401306
starting mark and sweep
bucket 0:174677 gen 2 data type none has wrong data_type: got 0, should be 4, fixing
bucket 0:174677 gen 2 data type user has wrong dirty_sectors: got 0, should be 40, fixing
bucket 0:175203 gen 2 data type user has wrong data_type: got 4, should be 0, fixing
bucket 0:175203 gen 2 data type none has wrong dirty_sectors: got 384, should be 0, fixing
bucket 0:175519 gen 2 data type user has wrong dirty_sectors: got 141, should be 8, fixing
bucket 0:1346072 gen 1 data type user has wrong dirty_sectors: got 77, should be 205, fixing
bucket 0:1700423 gen 1 data type none has wrong data_type: got 0, should be 4, fixing
bucket 0:1700423 gen 1 data type user has wrong dirty_sectors: got 0, should be 8, fixing
bucket 1:945708 gen 1 data type user has wrong dirty_sectors: got 6, should be 134, fixing
bucket 1:945709 gen 1 data type user has wrong dirty_sectors: got 32, should be 160, fixing
bucket 1:1167945 gen 1 data type none has wrong data_type: got 0, should be 4, fixing
bucket 1:1167945 gen 1 data type user has wrong dirty_sectors: got 0, should be 24, fixing
bucket 1:1194422 gen 1 data type none has wrong data_type: got 0, should be 4, fixing
bucket 1:1194422 gen 1 data type user has wrong dirty_sectors: got 0, should be 8, fixing
bucket 1:1247964 gen 1 data type user has wrong dirty_sectors: got 101, should be 205, fixing
bucket 1:1270159 gen 1 data type none has wrong data_type: got 0, should be 4, fixing
bucket 1:1270159 gen 1 data type user has wrong dirty_sectors: got 0, should be 8, fixing
bucket 1:1270325 gen 1 data type user has wrong dirty_sectors: got 32, should be 56, fixing
bucket 1:1270385 gen 1 data type user has wrong dirty_sectors: got 24, should be 8, fixing
bucket 1:1270618 gen 1 data type user has wrong dirty_sectors: got 211, should be 144, fixing
fs has wrong user: 1/1 [1]: got 613486459, should be 613486800, fixing
fs has wrong user: 1/1 [0]: got 877929170, should be 877928829, fixing
starting fsck
dirent points to missing inode:
u64s 9 type dirent 456110:4023759923222321886 snap 0 len 0 ver 0: default.addnhosts -> 6846 type 8, fixing
hash table key at wrong offset: btree 3, 161933195682246499, hashed to 2348500190113702514 chain starts at 161933195682246499
u64s 8 type xattr 6816:161933195682246499 snap 0 len 0 ver 0: user.coredump.pid:1563, fixing
hash_redo_key err -17
Error in recovery: error in fsck (-17)
error opening /dev/nvme0n1: File exists

As one might expect, checksum errors persist after fsck:

[   77.128108] bcachefs (nvme0n1 inum 1744830464 offset 968): data checksum error: expected 0:e4b4dbdb got 0:4a7d7ad0 (type 5)
[   77.128130] __bch2_read_extent: 6 callbacks suppressed
[   77.128131] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[   77.128228] bcachefs (nvme0n1 inum 1744830464 offset 968): data checksum error: expected 0:e4b4dbdb got 0:4a7d7ad0 (type 5)
[   77.128233] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[   77.128366] bcachefs (nvme0n1 inum 1744830464 offset 11376): data checksum error: expected 0:1482484f got 0:1466fa56 (type 5)
[   77.128372] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[   77.128480] bcachefs (nvme0n1 inum 1744830464 offset 11376): data checksum error: expected 0:1482484f got 0:1466fa56 (type 5)
[   77.128485] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[   77.128595] bcachefs (nvme0n1 inum 1744830464 offset 944): data checksum error: expected 0:bc5ab7c1 got 0:e613a192 (type 5)
[   77.128620] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[   77.128715] bcachefs (nvme0n1 inum 1744830464 offset 944): data checksum error: expected 0:bc5ab7c1 got 0:e613a192 (type 5)
[   77.128724] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[   77.128843] bcachefs (nvme0n1 inum 1744830464 offset 10704): data checksum error: expected 0:7821596 got 0:bc4d42f (type 5)
[   77.128852] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[   77.128960] bcachefs (nvme0n1 inum 1744830464 offset 10704): data checksum error: expected 0:7821596 got 0:bc4d42f (type 5)
[   77.128968] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[   77.141696] bcachefs (nvme0n1 inum 1744830464 offset 968): data checksum error: expected 0:e4b4dbdb got 0:4a7d7ad0 (type 5)
[   77.141723] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from
[   77.141829] bcachefs (nvme0n1 inum 1744830464 offset 968): data checksum error: expected 0:e4b4dbdb got 0:4a7d7ad0 (type 5)
[   77.141841] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b inum 1744830464): no device to read from

Is there a way to find out which file is the inum pointing to?

Eximius avatar Jan 17 '21 14:01 Eximius

Clean reboot after navi gpu lockup. (I would remove the pci device for the gpu and try a soft reset. The whole system is responsive.)

[   23.581527] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): error validating btree node at btree inodes level 0/1
                 u64s 12 type btree_ptr_v2 0:2132469 snap 0 len 0 ver 0: seq b8492c8a9e30821b sectors 512 written 0 min_key 0:2129689 ptr: 0:116561408 gen 3 ptr: 1:116529664 gen 4                                                                                                                                       
                 node offset 0: bad magic                                                                                                                    
[   23.581535] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): retrying read
[   23.581777] bcachefs (379774e3-3bc1-437f-81ca-970b92d7846b): error validating btree node at btree inodes level 0/1
                 u64s 12 type btree_ptr_v2 0:2132469 snap 0 len 0 ver 0: seq b8492c8a9e30821b sectors 512 written 0 min_key 0:2129689 ptr: 0:116561408 gen 3 ptr: 1:116529664 gen 4                                                                                                                                       
                 node offset 0: bad magic  

I am greeted with files missing:

$ virsh
zsh: Input/output error: virsh

Since the raid has 2 metadata replicas (on top of cow), I'd call this "very bad".

Eximius avatar Jan 17 '21 15:01 Eximius

It is a bit too painful having ordinary reboots kill the filesystem. So I will switch to something else. Also, having at least the filenames that are permanently removed shown on fix_errors is extremely mandatory.

Eximius avatar Jan 17 '21 17:01 Eximius