[master] reference to invalid bucket
I just got an emergency read-only after the following error
[98315.023660] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): reference to invalid bucket
u64s 13 type alloc_v4 15:1049075:0 len 0 ver 0:
gen 1 oldest_gen 1 data_type need_discard
journal_seq_nonempty 16077875
journal_seq_empty 16078733
need_discard 1
need_inc_gen 0
dirty_sectors 0
stripe_sectors 0
cached_sectors 0
stripe 0
stripe_redundancy 0
io_time[READ] 2554665168832
io_time[WRITE] 2609418830136
fragmentation 0
bp_start 8
[98315.023668] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): inconsistency detected - emergency read only at journal seq 16078733
[98315.023670] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): bch2_trans_commit_write_locked(): fatal error fatal error in transaction commit: EIO
[98315.026324] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): /home/silvio/.mozilla/firefox/default/storage/default/https+++element.booq.org/ls/data.sqlite offset 8192: write error: btree update error: EIO
from internal move u64s 10 type extent 58135771:56:4294967280 len 24 ver 709463959: durability: 2 crc: c_size 32 size 80 offset 32 nonce 0 csum chacha20_poly1305_80 adc0:d2aed89d6f57d66a compress
zstd ptr: 15:1049075:1552 gen 0 invalid ptr: 4:1640233:848 gen 31 rebalance: replicas=2 checksum=crc32c background_compression=zstd background_target=sata promote_target=nvme
[98315.052040] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): unclean shutdown complete, journal seq 16078733
I'm on commit a32d248c66703f54e594d13571cd7ea376600304 from the master branch.
On reboot I got
[ 19.223157] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): starting version 1.25: extent_flags opts=metadata_replicas=3,data_replicas=2,compression=zstd,foreground_target=nvme,background_target=sata,promote_target=nvme,nopromote_whole_extents
[ 19.223164] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): recovering from unclean shutdown
[ 76.740847] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): journal read done, replaying entries 16077927-16078732
[ 76.740854] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): dropped unflushed entries 16078733-16078733
[ 78.336970] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): accounting_read... done
[ 78.899370] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): alloc_read... done
[ 78.931654] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): stripes_read... done
[ 78.931658] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): snapshots_read... done
[ 79.016878] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): going read-write
[ 79.027833] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): journal_replay... done
[ 86.721599] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): resume_logged_ops... done
[ 86.843797] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): delete_dead_inodes... done
[ 87.107863] stage-1-init: [Sat Mar 8 12:49:44 UTC 2025] mounting /mnt-root/bcachefs/nix on /nix...
I guess the one unflushed journal entry was the faulty one, because now everything seems to be fine again. I can read the file mentioned in the error log without issues.
Has it happened again? There's a real alloc key referencing an invalid bucket so I expect it might - but if it does, fsck should correct it.
Curious what happened, but since it's need_discard no data should be affected.
It hasn't happened again yet. I'll report back if when/if it does.
I saw this on 6.14 after an online resize, also with a need_discard key.
Do you have the log, and the old and new nbuckets for the device being resized?
It just happened again after a resize:
[1693291.408531] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): reference to invalid bucket
u64s 13 type alloc_v4 9:2099818:0 len 0 ver 0:
gen 1 oldest_gen 0 data_type need_discard
journal_seq_nonempty 119936735
journal_seq_empty 119936767
need_discard 1
need_inc_gen 0
dirty_sectors 0
stripe_sectors 0
cached_sectors 0
stripe 0
stripe_redundancy 0
io_time[READ] 4840184893712
io_time[WRITE] 4714504893368
fragmentation 0
bp_start 8
[1693291.414790] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): inconsistency detected - emergency read only at journal seq 119936767
[1693291.415177] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): bch2_trans_commit_write_locked(): fatal error fatal error in transaction commit: EIO
[1693291.415491] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): inum 0:1438453 offset 3026944: write error(internal move): btree update error: EIO
[1693294.899616] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): unclean shutdown complete, journal seq 119936767
The old and new device sizes come through due to virtio
[1693221.255412] virtio_blk virtio3: [vdb] new size: 7814029312 512-byte logical blocks (4.00 TB/3.64 TiB)
[1693221.255420] vdb: detected capacity change from 6442450944 to 7814029312
[1693231.492972] virtio_blk virtio4: [vdc] new size: 5860524032 512-byte logical blocks (3.00 TB/2.73 TiB)
[1693231.492981] vdc: detected capacity change from 4294967296 to 5860524032
And the new bucket count is from bcachefs device resize
$ sudo bcachefs device resize /dev/vdb
Doing online resize of /dev/vdb
resizing /dev/vdb to 3815444 buckets
$ sudo bcachefs device resize /dev/vdc
Doing online resize of /dev/vdc
resizing /dev/vdc to 2861584 buckets
I'm just missing the old bucket size, but it can likely be estimated quite accurately using (newbuckets/newsize)*oldsize.
Oh, and /dev/vdc is the device index 9 from the journal log. Seems the bucket index from the error is very likely to be greater than the old nbuckets. The readonly occured a good few seconds after the resize, so my hypothesis is that the transaction should have been valid, and the bucket count is somehow not being invalidated somewhere.
i think this bug might still be active
this was a resize bug, with something particular that was needed to make it pop, wasn't it? it's not showing up in the automated tests, so can you give me more info?
What makes this bug trigger is having 2 resizes done within the same second. One of them will not complete correctly.
I wrote a multi device resize test and got a similar but not identical error to pop on 6.14, and the test is passing on 6.16. Could either of you confirm that it's fixed?