bcachefs [master] reference to invalid bucket

I just got an emergency read-only after the following error

[98315.023660] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): reference to invalid bucket
                 u64s 13 type alloc_v4 15:1049075:0 len 0 ver 0:
                 gen 1 oldest_gen 1 data_type need_discard
                 journal_seq_nonempty 16077875
                 journal_seq_empty    16078733
                 need_discard         1
                 need_inc_gen         0
                 dirty_sectors        0
                 stripe_sectors       0
                 cached_sectors       0
                 stripe               0
                 stripe_redundancy    0
                 io_time[READ]        2554665168832
                 io_time[WRITE]       2609418830136
                 fragmentation     0
                 bp_start          8

[98315.023668] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): inconsistency detected - emergency read only at journal seq 16078733
[98315.023670] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): bch2_trans_commit_write_locked(): fatal error fatal error in transaction commit: EIO
[98315.026324] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): /home/silvio/.mozilla/firefox/default/storage/default/https+++element.booq.org/ls/data.sqlite offset 8192: write error: btree update error: EIO
                 from internal move u64s 10 type extent 58135771:56:4294967280 len 24 ver 709463959: durability: 2 crc: c_size 32 size 80 offset 32 nonce 0 csum chacha20_poly1305_80 adc0:d2aed89d6f57d66a  compress
 zstd ptr: 15:1049075:1552 gen 0 invalid ptr: 4:1640233:848 gen 31 rebalance: replicas=2 checksum=crc32c background_compression=zstd background_target=sata promote_target=nvme
[98315.052040] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): unclean shutdown complete, journal seq 16078733

I'm on commit a32d248c66703f54e594d13571cd7ea376600304 from the master branch.

Mar 08 '25 12:03 Lykos153

On reboot I got

[   19.223157] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): starting version 1.25: extent_flags opts=metadata_replicas=3,data_replicas=2,compression=zstd,foreground_target=nvme,background_target=sata,promote_target=nvme,nopromote_whole_extents
[   19.223164] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): recovering from unclean shutdown
[   76.740847] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): journal read done, replaying entries 16077927-16078732
[   76.740854] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): dropped unflushed entries 16078733-16078733
[   78.336970] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): accounting_read... done
[   78.899370] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): alloc_read... done
[   78.931654] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): stripes_read... done
[   78.931658] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): snapshots_read... done
[   79.016878] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): going read-write
[   79.027833] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): journal_replay... done
[   86.721599] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): resume_logged_ops... done
[   86.843797] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): delete_dead_inodes... done
[   87.107863] stage-1-init: [Sat Mar  8 12:49:44 UTC 2025] mounting /mnt-root/bcachefs/nix on /nix...

I guess the one unflushed journal entry was the faulty one, because now everything seems to be fine again. I can read the file mentioned in the error log without issues.

Mar 08 '25 12:03 Lykos153

Has it happened again? There's a real alloc key referencing an invalid bucket so I expect it might - but if it does, fsck should correct it.

Curious what happened, but since it's need_discard no data should be affected.

Mar 09 '25 14:03 koverstreet

It hasn't happened again yet. I'll report back if when/if it does.

Mar 10 '25 19:03 Lykos153

I saw this on 6.14 after an online resize, also with a need_discard key.

Jun 07 '25 21:06 RX14

Do you have the log, and the old and new nbuckets for the device being resized?

Jun 07 '25 21:06 koverstreet

It just happened again after a resize:

[1693291.408531] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): reference to invalid bucket
                   u64s 13 type alloc_v4 9:2099818:0 len 0 ver 0: 
                   gen 1 oldest_gen 0 data_type need_discard
                   journal_seq_nonempty 119936735
                   journal_seq_empty    119936767
                   need_discard         1
                   need_inc_gen         0
                   dirty_sectors        0
                   stripe_sectors       0
                   cached_sectors       0
                   stripe               0
                   stripe_redundancy    0
                   io_time[READ]        4840184893712
                   io_time[WRITE]       4714504893368
                   fragmentation     0
                   bp_start          8

[1693291.414790] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): inconsistency detected - emergency read only at journal seq 119936767
[1693291.415177] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): bch2_trans_commit_write_locked(): fatal error fatal error in transaction commit: EIO
[1693291.415491] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): inum 0:1438453 offset 3026944: write error(internal move): btree update error: EIO
[1693294.899616] bcachefs (f7fa14ed-5a8e-4b14-b39a-8b5c21f8bc25): unclean shutdown complete, journal seq 119936767

The old and new device sizes come through due to virtio

[1693221.255412] virtio_blk virtio3: [vdb] new size: 7814029312 512-byte logical blocks (4.00 TB/3.64 TiB)
[1693221.255420] vdb: detected capacity change from 6442450944 to 7814029312
[1693231.492972] virtio_blk virtio4: [vdc] new size: 5860524032 512-byte logical blocks (3.00 TB/2.73 TiB)
[1693231.492981] vdc: detected capacity change from 4294967296 to 5860524032

And the new bucket count is from bcachefs device resize

$ sudo bcachefs device resize /dev/vdb
Doing online resize of /dev/vdb
resizing /dev/vdb to 3815444 buckets

$ sudo bcachefs device resize /dev/vdc
Doing online resize of /dev/vdc
resizing /dev/vdc to 2861584 buckets

I'm just missing the old bucket size, but it can likely be estimated quite accurately using (newbuckets/newsize)*oldsize.

Jun 10 '25 12:06 RX14

Oh, and /dev/vdc is the device index 9 from the journal log. Seems the bucket index from the error is very likely to be greater than the old nbuckets. The readonly occured a good few seconds after the resize, so my hypothesis is that the transaction should have been valid, and the bucket count is somehow not being invalidated somewhere.

Jun 10 '25 12:06 RX14

i think this bug might still be active

Aug 02 '25 00:08 koverstreet

this was a resize bug, with something particular that was needed to make it pop, wasn't it? it's not showing up in the automated tests, so can you give me more info?

Aug 07 '25 00:08 koverstreet

What makes this bug trigger is having 2 resizes done within the same second. One of them will not complete correctly.

Aug 07 '25 05:08 ticpu

I wrote a multi device resize test and got a similar but not identical error to pop on 6.14, and the test is passing on 6.16. Could either of you confirm that it's fixed?

Aug 08 '25 22:08 koverstreet