bcachefs icon indicating copy to clipboard operation
bcachefs copied to clipboard

Getting stuck trying to create a new 1 GiB file on an FS that has 41 GiBs of free space (needs: allocator to check for ENOSPC aside from disk reservations)

Open Jayman2000 opened this issue 3 months ago • 11 comments

I have a bcachefs filesystem that’s at 90% capacity. According to bcachefs fs usage, I have 41 GiBs of free space left. When I try to write data to that file system, it will work, but only if the data is really small (i.e., if I create a new plain text file that contains a single sentence). If I try to create a new file that contains 1 GiB of data, then the program creating the file will get stuck seemingly forever. At this point, if I try to shut down the system, then it will take a while because systemd will get stuck waiting for sd-sync to finish. Eventually, systemd will give up waiting for sd-sync and forcefully shut down the system.

Version information

  • Linux version: 6.16.9
  • bcachefs-tools version: 1.31.3
  • I’m using the DKMS module that comes with bcachefs-tools.

Steps to reproduce

  1. Make sure that you have a problematic bcachefs filesystem. I don’t know how to create a problematic bcachefs filesystem from scratch, but I do have a backup of a problematic bcachefs filesystem that I’ve been using for testing.

  2. Mount the problematic filesystem by running this command:

    run0 mount UUID=<UUID of problematic filesystem> <mountpoint>
    
  3. Wait for that command to finish.

  4. Change directory into the newly mounted filesystem by running this command:

    cd <path to mountpoint>
    
  5. Try to create a new 1 GiB file in the newly mounted filesystem by running this command:

    run0 dd if=/dev/zero of='Test file' bs=1048576 count=1024 status=progress
    

Results

The dd command seemingly never finishes. After a little bit, it gets stuck showing something like this:

570425344 bytes (570 MB, 544 MiB) copied, 23 s, 24.7 MB/s

Here’s a log of kernel messages that were produced after dd got stuck.

Jayman2000 avatar Oct 02 '25 17:10 Jayman2000

Your filesystem somehow got itself very low on actual non-reserved space, and copygc does not seem to make progress (?). Please post bcachefs show-super output, bcachefs fs usage -ha before writing a large file, bcachefs fs usage -ha after the "Allocator stuck?" messages show up, and /sys/fs/bcachefs/*/internal/moving_ctxts contents at the same time.

himikof avatar Oct 12 '25 15:10 himikof

OK. I reproduced the bug again. This time I used Linux version 6.17.1, bcachefs-tools version 1.31.7 and the DKMS module that comes with that version of bcachefs-tools. Here’s the output from bcachefs show-super <device>:

External UUID:                             ccd95d13-0ffb-4123-9f77-59bc18232b38
Internal UUID:                             5d101165-1b29-4949-9fed-45d8174314ab
Magic number:                              c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index:                              0
Label:                                     (none)
Version:                                   1.28: inode_has_case_insensitive
Incompatible features allowed:             1.20: directory_size
Incompatible features in use:              0.0: (unknown version)
Version upgrade complete:                  1.28: inode_has_case_insensitive
Oldest version on disk:                    1.20: directory_size
Created:                                   Fri Jul 11 12:35:30 2025
Sequence number:                           517
Time of last write:                        Fri Sep 19 07:06:52 2025
Superblock size:                           5.20 KiB/1.00 MiB
Clean:                                     0
Devices:                                   1
Sections:                                  members_v1,replicas_v0,clean,journal_seq_blacklist,journal_v2,counters,members_v2,errors,ext,downgrade,recovery_passes
Features:                                  journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:                           alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
  block_size:                              512 B
  btree_node_size:                         256 KiB
  errors:                                  continue [fix_safe] panic ro
  write_error_timeout:                     30
  metadata_replicas:                       1
  data_replicas:                           1
  metadata_replicas_required:              1
  data_replicas_required:                  1
  encoded_extent_max:                      64.0 KiB
  metadata_checksum:                       none [crc32c] crc64 xxhash
  data_checksum:                           none [crc32c] crc64 xxhash
  checksum_err_retry_nr:                   3
  compression:                             none
  background_compression:                  none
  str_hash:                                crc32c crc64 [siphash]
  metadata_target:                         none
  foreground_target:                       none
  background_target:                       none
  promote_target:                          none
  erasure_code:                            0
  casefold:                                0
  inodes_32bit:                            1
  shard_inode_numbers_bits:                3
  inodes_use_key_cache:                    1
  gc_reserve_percent:                      8
  gc_reserve_bytes:                        0 B
  root_reserve_percent:                    0
  wide_macs:                               0
  promote_whole_extents:                   1
  acl:                                     1
  usrquota:                                0
  grpquota:                                0
  prjquota:                                0
  degraded:                                [ask] yes very no
  journal_flush_delay:                     1000
  journal_flush_disabled:                  0
  journal_reclaim_delay:                   100
  journal_transaction_names:               1
  allocator_stuck_timeout:                 30
  version_upgrade:                         [compatible] incompatible none
  nocow:                                   0
  rebalance_on_ac_only:                    0

errors (size 8):
Device 0:                                  /dev/vdb2       (unknown model)
  Label:                                   (none)
  UUID:                                    2b36a905-92ec-4007-a006-b64096633531
  Size:                                    441 GiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             441 KiB
  First bucket:                            0
  Buckets:                                 1048576
  Last mount:                              Fri Sep 19 07:06:52 2025
  Last superblock write:                   517
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                journal,btree,user
  Btree allocated bitmap blocksize:        16.0 MiB
  Btree allocated bitmap:                  0000000011111111111111111111100111111001111111101100000000011011
  Durability:                              1
  Discard:                                 1
  Freespace initialized:                   1
  Resize on mount:                         0

Here’s the output of bcachefs fs usage -ha before I ran the dd command:

Filesystem: ccd95d13-0ffb-4123-9f77-59bc18232b38
Size:                        405 GiB
Used:                        397 GiB
Online reserved:             512 KiB

Data by durability desired and amount degraded:
          undegraded
1x:          397 GiB
reserved:    253 MiB

Data type      Required/total  Durability    Devices
reserved:      1/1                  [] 253 MiB
btree:         1/1             1             [vdb2]               9.70 GiB
user:          1/1             1             [vdb2]                387 GiB

Btree usage:
extents:            1.59 GiB
inodes:             3.08 GiB
dirents:            1.13 GiB
xattrs:              256 KiB
alloc:               153 MiB
reflink:             201 MiB
subvolumes:          256 KiB
snapshots:           256 KiB
lru:                2.25 MiB
freespace:           512 KiB
need_discard:       2.00 MiB
backpointers:       1.10 GiB
bucket_gens:        2.25 MiB
snapshot_trees:      256 KiB
deleted_inodes:      256 KiB
logged_ops:          256 KiB
accounting:         2.44 GiB

(no label) (device 0):          vdb2              rw    90%
                                data         buckets    fragmented
  free:                     13.3 MiB              31
  sb:                       2.00 MiB               5       151 KiB
  journal:                  3.44 GiB            8192
  btree:                    9.70 GiB           39736      6.99 GiB
  user:                      387 GiB          926403      2.38 GiB
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:             31.2 GiB           74209
  unstriped:                     0 B               0
  capacity:                  441 GiB         1048576
  bucket size:               441 KiB

Here’s the output of bcachefs fs usage -ha after one of the “Allocator stuck?” messages appeared:

Filesystem: ccd95d13-0ffb-4123-9f77-59bc18232b38
Size:                        405 GiB
Used:                        397 GiB
Online reserved:             644 MiB

Data by durability desired and amount degraded:
          undegraded
1x:          397 GiB
reserved:    253 MiB

Data type      Required/total  Durability    Devices
reserved:      1/1                  [] 253 MiB
btree:         1/1             1             [vdb2]               9.70 GiB
user:          1/1             1             [vdb2]                387 GiB

Btree usage:
extents:            1.59 GiB
inodes:             3.08 GiB
dirents:            1.13 GiB
xattrs:              256 KiB
alloc:               153 MiB
reflink:             201 MiB
subvolumes:          256 KiB
snapshots:           256 KiB
lru:                2.25 MiB
freespace:           512 KiB
need_discard:       2.00 MiB
backpointers:       1.10 GiB
bucket_gens:        2.25 MiB
snapshot_trees:      256 KiB
deleted_inodes:      256 KiB
logged_ops:          256 KiB
accounting:         2.44 GiB

(no label) (device 0):          vdb2              rw    90%
                                data         buckets    fragmented
  free:                     12.9 MiB              30
  sb:                       2.00 MiB               5       151 KiB
  journal:                  3.44 GiB            8192
  btree:                    9.70 GiB           39733      6.99 GiB
  user:                      387 GiB          926325      2.34 GiB
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:             31.2 GiB           74291
  unstriped:                     0 B               0
  capacity:                  441 GiB         1048576
  bucket size:               441 KiB

Here’s the output of cat /sys/fs/bcachefs/*/internal/moving_ctxts after one of the “Allocator stuck?” messages appeared:

rebalance_work: data type==user pos=extents:POS_MIN
  keys moved:                  0
  keys raced:                  0
  bytes seen:                  0 B
  bytes moved:                 0 B
  bytes raced:                 0 B
  reads: ios 0/32 sectors 0/2048
  writes: ios 0/32 sectors 0/2048
copygc: data type==user pos=extents:74810271:0:626031552
  keys moved:                  97113
  keys raced:                  253
  bytes seen:                  2.36 TiB
  bytes moved:                 1.35 GiB
  bytes raced:                 3.31 MiB
  reads: ios 0/32 sectors 0/2048
  writes: ios 0/32 sectors 0/2048

Jayman2000 avatar Oct 13 '25 17:10 Jayman2000

Two things of note: first, you are not actually using the DKMS module (1.31), your FS has version 1.28 which matches the in-tree 6.16 kernel version.

Second, your FS has non-power-of-2 bucket size, which is suboptimal. It would be great if you could provide information on the way it was initially formatted, most importantly the bcachefs-tools version used for formatting. I believe that all issues leading to non-round bucket sizes being chosen on format were long fixed, but maybe you've found another case.

So, what's going on here is that you have a large amount of space (~7 GB) in fragmented btree usage. Usually copygc would be able to better pack metadata (btree) and free up this space, so we do not account it as "used". But due to bad bucket size the "bucket tails" are actually unusable, and copygc cannot do anything about them.

So the actual bug here is that the filesystem fails to return ENOSPC due to misaccounting of free space with unaligned bucket sizes. It is still an issue that should be fixed, but maybe not a high-priority one.

On the other hand, if you know how to reproduce bcachefs format choosing such bucket size, that would be a very high-priority issue.

himikof avatar Oct 13 '25 17:10 himikof

Two things of note: first, you are not actually using the DKMS module (1.31), your FS has version 1.28 which matches the in-tree 6.16 kernel version.

That’s surprising to hear. I thought that there were some situations where the FS version would not match the latest FS version supported by the bcachefs kernel module you were using. I guess that there aren’t any situations where that can happen which is surprising to me. Is there anything that I can do in order to force it to use the DKMS bcachefs module instead of the in-tree bcachefs module?

Second, your FS has non-power-of-2 bucket size, which is suboptimal. It would be great if you could provide information on the way it was initially formatted, most importantly the bcachefs-tools version used for formatting. I believe that all issues leading to non-round bucket sizes being chosen on format were long fixed, but maybe you've found another case.

The filesystem was created when I used this Nix flake to do an unattended installation of NixOS on my laptop. I don’t know for sure which revision of that flake I used, but I’m guessing that I used e034966a907a9f97076a36520acf39d2c42980d9. That commit was made at “Fri Jul 11 11:42:54 2025 -0400” which is right before the time that the filesystem was created (“Fri Jul 11 12:35:30 2025”). I don’t have any logs from back when I did that unattended installation, but I was able to do a new unattended installation using revision e034966a907a9f97076a36520acf39d2c42980d9 of that flake. The new unattended installation used Linux version 6.14.11 and bcachefs-tools version 1.25.1.

Here’s a log of what the unattended installer did for disk partitioning and filesystem creation:

umount: /mnt/disko-install-root: not mounted
++ realpath /dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ disk=/dev/nvme0n1
+ lsblk -a -f
NAME        FSTYPE   FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
loop0
loop1
loop2
loop3
loop4
loop5
loop6
loop7
sda
├─sda1      vfat     FAT32       AD49-EAA1                             995.8M     3% /boot
└─sda2      bcachefs 1.20        d16fab8a-0000-41ce-bd92-73a7ce153580   16.2G    35% /nix/store
                                                                                     /
nvme0n1
├─nvme0n1p1 vfat     FAT32       2901-C94E
├─nvme0n1p2 bcachefs 1.28        07ba0b33-eb07-422c-ae09-5ec16bbd938c
└─nvme0n1p3 swap     1           51423e66-8a0f-4ab8-b84b-84394e13010b
+ lsblk --output-all --json
+ bash -x
++ dirname /nix/store/fpwn44vygjj6bfn8s1jj9p8yh6jhfxni-disk-deactivate/disk-deactivate
+ jq -r -f /nix/store/fpwn44vygjj6bfn8s1jj9p8yh6jhfxni-disk-deactivate/zfs-swap-deactivate.jq
+ lsblk --output-all --json
+ bash -x
++ dirname /nix/store/fpwn44vygjj6bfn8s1jj9p8yh6jhfxni-disk-deactivate/disk-deactivate
+ jq -r --arg disk_to_clear /dev/nvme0n1 -f /nix/store/fpwn44vygjj6bfn8s1jj9p8yh6jhfxni-disk-deactivate/disk-deactivate.jq
+ set -fu
+ wipefs --all -f /dev/nvme0n1p1
/dev/nvme0n1p1: 8 bytes were erased at offset 0x00000052 (vfat): 46 41 54 33 32 20 20 20
/dev/nvme0n1p1: 1 byte was erased at offset 0x00000000 (vfat): eb
/dev/nvme0n1p1: 2 bytes were erased at offset 0x000001fe (vfat): 55 aa
+ wipefs --all -f /dev/nvme0n1p2
/dev/nvme0n1p2: 16 bytes were erased at offset 0x00001018 (bcachefs): c6 85 73 f6 66 ce 90 a9 d9 6a 60 cf 80 3d f7 ef
/dev/nvme0n1p2: 16 bytes were erased at offset 0x6e30a00018 (bcachefs): c6 85 73 f6 66 ce 90 a9 d9 6a 60 cf 80 3d f7 ef
+ swapoff /dev/nvme0n1p3
swapoff: /dev/nvme0n1p3: swapoff failed: Invalid argument
+ wipefs --all -f /dev/nvme0n1p3
/dev/nvme0n1p3: 10 bytes were erased at offset 0x00000ff6 (swap): 53 57 41 50 53 50 41 43 45 32
++ type zdb
++ zdb -l /dev/nvme0n1
++ sed -nr 's/ +name: '\''(.*)'\''/\1/p'
+ zpool=
+ [[ -n '' ]]
+ unset zpool
++ lsblk /dev/nvme0n1 -l -p -o type,name
++ awk 'match($1,"raid.*") {print $2}'
+ md_dev=
+ [[ -n '' ]]
+ wipefs --all -f /dev/nvme0n1
/dev/nvme0n1: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
/dev/nvme0n1: 8 bytes were erased at offset 0x7470c05e00 (gpt): 45 46 49 20 50 41 52 54
/dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
+ dd if=/dev/zero of=/dev/nvme0n1 bs=440 count=1
1+0 records in
1+0 records out
440 bytes copied, 0.000212851 s, 2.1 MB/s
+ lsblk -a -f
NAME    FSTYPE   FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
loop0
loop1
loop2
loop3
loop4
loop5
loop6
loop7
sda
├─sda1  vfat     FAT32       AD49-EAA1                             995.8M     3% /boot
└─sda2  bcachefs 1.20        d16fab8a-0000-41ce-bd92-73a7ce153580   16.2G    35% /nix/store
                                                                                 /
nvme0n1
++ mktemp -d
+ disko_devices_dir=/tmp/tmp.JcpTlbs8vt
+ trap 'rm -rf "$disko_devices_dir"' EXIT
+ mkdir -p /tmp/tmp.JcpTlbs8vt
+ destroy=1
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ imageName=main
+ imageSize=2G
+ name=main
+ type=disk
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ efiGptPartitionFirst=1
+ type=gpt
+ blkid /dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ sgdisk --clear /dev/disk/by-path/pci-0000:02:00.0-nvme-1
 nvme0n1:
Creating new GPT entries in memory.
The operation has completed successfully.
 nvme0n1:
+ sgdisk --align-end --new=1:0:+1G --partition-guid=1:R --change-name=1:disk-main-efiSystemPartiton --typecode=1:C12A7328-F81F-11D2-BA4B-00A0C93EC93B /dev/disk/by-path/pci-0000:02:00.0-nvme-1
The operation has completed successfully.
 nvme0n1: p1
+ partprobe /dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ udevadm trigger --subsystem-match=block
+ udevadm settle --timeout 120
+ sgdisk --align-end --new=2:0:-24G --partition-guid=2:R --change-name=2:disk-main-nixosRoot --typecode=2:4F68BCE3-E8CD-4DB1-96E7-FBCAF984B709 /dev/disk/by-path/pci-0000:02:00.0-nvme-1
 nvme0n1: p1 p2
The operation has completed successfully.
 nvme0n1: p1 p2
+ partprobe /dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ udevadm trigger --subsystem-match=block
+ udevadm settle --timeout 120
+ sgdisk --align-end --new=3:0:-0 --partition-guid=3:R --change-name=3:disk-main-nixosSwap --typecode=3:0657fd6d-a4ab-43c4-84e5-0933c84b4f4f /dev/disk/by-path/pci-0000:02:00.0-nvme-1
 nvme0n1: p1 p2 p3
The operation has completed successfully.
 nvme0n1: p1 p2 p3
+ partprobe /dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ udevadm trigger --subsystem-match=block
+ udevadm settle --timeout 120
+ device=/dev/disk/by-partlabel/disk-main-efiSystemPartiton
+ extraArgs=()
+ declare -a extraArgs
+ format=vfat
+ mountOptions=('umask=0077')
+ declare -a mountOptions
+ mountpoint=/boot
+ type=filesystem
+ blkid /dev/disk/by-partlabel/disk-main-efiSystemPartiton
+ grep -q TYPE=
+ mkfs.vfat /dev/disk/by-partlabel/disk-main-efiSystemPartiton
mkfs.fat 4.2 (2021-01-31)
+ device=/dev/disk/by-partlabel/disk-main-nixosRoot
+ extraArgs=()
+ declare -a extraArgs
+ format=bcachefs
+ mountOptions=('defaults')
+ declare -a mountOptions
+ mountpoint=/
+ type=filesystem
+ blkid /dev/disk/by-partlabel/disk-main-nixosRoot
+ grep -q TYPE=
+ mkfs.bcachefs /dev/disk/by-partlabel/disk-main-nixosRoot
External UUID:                             5ebf454f-1b1c-4c2c-a6c9-feea70714593
Internal UUID:                             8f1421a9-dc1a-4301-8edd-9957ed7ceac3
Magic number:                              c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index:                              0
Label:                                     (none)
Version:                                   1.20: directory_size
Incompatible features allowed:             1.20: directory_size
Incompatible features in use:              0.0: (unknown version)
Version upgrade complete:                  0.0: (unknown version)
Oldest version on disk:                    1.20: directory_size
Created:                                   Tue Oct 14 16:34:57 2025
Sequence number:                           0
Time of last write:                        Thu Jan  1 00:00:00 1970
Superblock size:                           976 B/1.00 MiB
Clean:                                     0
Devices:                                   1
Sections:                                  members_v1,members_v2
Features:
Compat features:
Options:
  block_size:                              512 B
  btree_node_size:                         256 KiB
  errors:                                  continue [fix_safe] panic ro
  write_error_timeout:                     30
  metadata_replicas:                       1
  data_replicas:                           1
  metadata_replicas_required:              1
  data_replicas_required:                  1
  encoded_extent_max:                      64.0 KiB
  metadata_checksum:                       none [crc32c] crc64 xxhash
  data_checksum:                           none [crc32c] crc64 xxhash
  checksum_err_retry_nr:                   3
  compression:                             none
  background_compression:                  none
  str_hash:                                crc32c crc64 [siphash]
  metadata_target:                         none
  foreground_target:                       none
  background_target:                       none
  promote_target:                          none
  erasure_code:                            0
  inodes_32bit:                            1
  shard_inode_numbers_bits:                0
  inodes_use_key_cache:                    1
  gc_reserve_percent:                      8
  gc_reserve_bytes:                        0 B
  root_reserve_percent:                    0
  wide_macs:                               0
  promote_whole_extents:                   1
  acl:                                     1
  usrquota:                                0
  grpquota:                                0
  prjquota:                                0
  journal_flush_delay:                     1000
  journal_flush_disabled:                  0
  journal_reclaim_delay:                   100
  journal_transaction_names:               1
  allocator_stuck_timeout:                 30
  version_upgrade:                         [compatible] incompatible none
  nocow:                                   0
members_v2 (size 160):
Device:                                    0
  Label:                                   (none)
  UUID:                                    35e06f43-c9ed-4be8-bf86-7f33ec28003f
  Size:                                    441 GiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             441 KiB
  First bucket:                            0
  Buckets:                                 1048576
  Last mount:                              (never)
  Last superblock write:                   0
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                (none)
  Btree allocated bitmap blocksize:        1.00 B
bcachefs (nvme0n1p2): starting version 1.20: directory_size
bcachefs (nvme0n1p2): initializing new filesystem
  Btree allocated bitmap:                  0000000000000000000000000000000000000000000000000000000000000000
  Durability:                              1
  Discard:                                 1
  Freespace initialized:                   0
+ device=/dev/disk/by-partlabel/disk-main-nixosSwap
+ discardPolicy=
+ extraArgs=()
+ declare -a extraArgs
+ mountOptions=('defaults')
+ declare -a mountOptions
+ priority=
+ randomEncryption=
+ resumeDevice=
+ type=swap
+ blkid /dev/disk/by-partlabel/disk-main-nixosSwap -o export
+ grep -q '^TYPE='
+ mkswap /dev/disk/by-partlabel/disk-main-nixosSwap
Setting up swapspace version 1, size = 24 GiB (25769799680 bytes)
no label, UUID=d5c89a48-3edc-4b97-950b-40c483e03206
+ set -efux
+ destroy=1
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ imageName=main
+ imageSize=2G
+ name=main
+ type=disk
bcachefs (nvme0n1p2): going read-write
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ efiGptPartitionFirst=1
+ type=gpt
+ destroy=1
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ imageName=main
+ imageSize=2G
+ name=main
+ type=disk
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ efiGptPartitionFirst=1
+ type=gpt
+ device=/dev/disk/by-partlabel/disk-main-nixosRoot
+ extraArgs=()
+ declare -a extraArgs
+ format=bcachefs
+ mountOptions=('defaults')
+ declare -a mountOptions
+ mountpoint=/
+ type=filesystem
+ findmnt /dev/disk/by-partlabel/disk-main-nixosRoot /mnt/disko-install-root/
+ mount /dev/disk/by-partlabel/disk-main-nixosRoot /mnt/disko-install-root/ -t bcachefs -o defaults -o X-mount.mkdir
bcachefs (nvme0n1p2): initializing freespace
+ destroy=1
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ imageName=main
+ imageSize=2G
+ name=main
+ type=disk
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ efiGptPartitionFirst=1
+ type=gpt
+ device=/dev/disk/by-partlabel/disk-main-efiSystemPartiton
+ extraArgs=()
+ declare -a extraArgs
+ format=vfat
+ mountOptions=('umask=0077')
+ declare -a mountOptions
+ mountpoint=/boot
+ type=filesystem
+ findmnt /dev/disk/by-partlabel/disk-main-efiSystemPartiton /mnt/disko-install-root/boot
+ mount /dev/disk/by-partlabel/disk-main-efiSystemPartiton /mnt/disko-install-root/boot -t vfat -o umask=0077 -o X-mount.mkdir
+ destroy=1
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ imageName=main
+ imageSize=2G
+ name=main
+ type=disk
+ device=/dev/disk/by-path/pci-0000:02:00.0-nvme-1
+ efiGptPartitionFirst=1
+ type=gpt
+ device=/dev/disk/by-partlabel/disk-main-nixosSwap
+ discardPolicy=
+ extraArgs=()
+ declare -a extraArgs
+ mountOptions=('defaults')
+ declare -a mountOptions
+ priority=
+ randomEncryption=
+ resumeDevice=
+ type=swap
+ test 1 '!=' 1
+ rm -rf /tmp/tmp.JcpTlbs8vt

Jayman2000 avatar Oct 14 '25 17:10 Jayman2000

Is there anything that I can do in order to force it to use the DKMS bcachefs module instead of the in-tree bcachefs module?

I'm not a NixOS expert, but I've heard that with recent packaging changes it picks the module from tools automatically. You can ask for help in the IRC channel if needed, the NixOS bcachefs maintainers are generally there.

The new unattended installation used Linux version 6.14.11 and bcachefs-tools version 1.25.1.

Hooray for deterministic configuration, this clarifies the origin of this issue: the format picking bad bucket sizes was fixed in bcachefs-tools 1.25.2 (released in April), so unfortunately you missed that fix while installing.

In case you want to change the bucket size on your FS to a proper one, you can either reformat the filesystem with recent tools while copying the data manually, or alternatively add another temporary device, evacuate+remove the old one, wipe the old device, add it back (it will pick proper bucket size here) and finally evacuate+remove the temporary device.

himikof avatar Oct 14 '25 17:10 himikof

That is a lot of btree fragmentation

Would you be able to get me a metadata dump?

koverstreet avatar Oct 24 '25 05:10 koverstreet

Would you be able to get me a metadata dump?

How do I do that?

Jayman2000 avatar Oct 24 '25 21:10 Jayman2000

@koverstreet Why do you think btree fragmentation is too high?

The FS is using 441 KiB buckets, so with 256 KiB btree node size and 39733 buckets used we get (441-256)*39733/1024/1024 ~= 7 GiB of btree fragmentation from unusable bucket tails, which approximately matches the number from fs usage.

himikof avatar Oct 26 '25 01:10 himikof

Good catch - we should be aligning the bucket size to btree node size and aiming for powers of two; I'm curious if this fs was created when bcachefs-tools was buggy and picking misaligned sizes

koverstreet avatar Oct 26 '25 02:10 koverstreet

I'm curious if this fs was created when bcachefs-tools was buggy and picking misaligned sizes

Yes, as I've said above:

Hooray for deterministic configuration, this clarifies the origin of this issue: the format picking bad bucket sizes was fixed in bcachefs-tools 1.25.2 (released in April), so unfortunately you missed that fix while installing.

The FS was installed with bcachefs-tools version 1.25.1.

The actual bug here is that the filesystem fails to return ENOSPC (and gets stuck in allocator) due to misaccounting of free space with unaligned bucket sizes. Usually copygc would be able to better pack stuff and free up this space, so we do not account it as "used". But due to bad bucket size the "bucket tails" are actually unusable, and copygc cannot do anything about them.

himikof avatar Oct 26 '25 03:10 himikof

Ok, sorry for not reading enough - this is an old known issue that normally only affects devices with pathologically mismatched sizes (two device filesystem, mismatched device sizes, replicas=2).

@Jayman2000 - you might want to recreate your filesystem if you can, I will bump up the priority on this one but it'll still be a bit before I get to it.

koverstreet avatar Oct 26 '25 14:10 koverstreet