btrfs-progs icon indicating copy to clipboard operation
btrfs-progs copied to clipboard

Feature request: Add more balancing filters

Open Forza-tng opened this issue 5 months ago • 1 comments

It think it would be useful to have additional balancing filters for more fine-grained control over what chunks that should be balanced.

My use-case is an unbalanced conversion from RAID1 to RAID10:

Data,RAID10: Size:55031.33GiB, Used:54990.07GiB (99.93%)
   /dev/sdg1    10404.27GiB
   /dev/sdc1    10403.27GiB
   /dev/sdk1    10404.27GiB
   /dev/sdp1    10404.27GiB
   /dev/sdx1    10403.27GiB
   /dev/sdad1   10404.27GiB
   /dev/sdab1   10404.27GiB
   /dev/sdz1    10403.27GiB
   /dev/sdw1    10403.27GiB
   /dev/sdy1    10404.27GiB
   /dev/sda1    6024.00GiB

Case 1

Using btrfs inspect-internal dump-tree --device we can see for each chunk item what stripe is on what device. For example, item 178421878161408 does not have a stripe on devid 27, which is /dev/sda1.

        item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 178416509452288) itemoff 14683 itemsize 368
                length 5368709120 owner 2 stripe_len 65536 type DATA|RAID10
                io_align 65536 io_width 65536 sector_size 4096
                num_stripes 10 sub_stripes 2
                        stripe 0 devid 15 offset 10659036135424
                        dev_uuid ac4afb76-8475-4af4-9bf4-5b8ce0e912dd
                        stripe 1 devid 16 offset 10659036135424
                        dev_uuid c52a732e-73e9-460d-b075-61584b44a617
                        stripe 2 devid 18 offset 9819370029056
                        dev_uuid b586966c-9b7a-42db-843c-1a6d0f420a18
                        stripe 3 devid 17 offset 9829033705472
                        dev_uuid 221850f8-2585-44fc-a6ac-7dc7e51b6f3e
                        stripe 4 devid 14 offset 9835711037440
                        dev_uuid d501f99f-7f76-4dcd-a6b2-efa64f4a8bde
                        stripe 5 devid 13 offset 9818329841664
                        dev_uuid c15a9ec1-4c99-4bcd-b272-37d9b5c58de1
                        stripe 6 devid 11 offset 9829033705472
                        dev_uuid 56b79aac-0892-4f34-a610-ddc7706bc6d1
                        stripe 7 devid 20 offset 9833777463296
                        dev_uuid 64d099f1-46f6-42fa-b89d-1544c92ab7b9
                        stripe 8 devid 19 offset 9828408754176
                        dev_uuid b258e15c-988a-418f-8eb8-c3978d5434b5
                        stripe 9 devid 27 offset 10711649484800
                        dev_uuid d1fe6aa3-b1bf-4c4d-aff7-c88f3d3b5970
        item 8 key (FIRST_CHUNK_TREE CHUNK_ITEM 178421878161408) itemoff 14315 itemsize 368
                length 5368709120 owner 2 stripe_len 65536 type DATA|RAID10
                io_align 65536 io_width 65536 sector_size 4096
                num_stripes 10 sub_stripes 2
                        stripe 0 devid 14 offset 9837858521088
                        dev_uuid d501f99f-7f76-4dcd-a6b2-efa64f4a8bde
                        stripe 1 devid 15 offset 10661183619072
                        dev_uuid ac4afb76-8475-4af4-9bf4-5b8ce0e912dd
                        stripe 2 devid 16 offset 10653667426304
                        dev_uuid c52a732e-73e9-460d-b075-61584b44a617
                        stripe 3 devid 18 offset 9830107447296
                        dev_uuid b586966c-9b7a-42db-843c-1a6d0f420a18
                        stripe 4 devid 12 offset 9833362227200
                        dev_uuid baa68f83-b3c4-4d39-a3ab-9ee6d2c62b41
                        stripe 5 devid 17 offset 9831181189120
                        dev_uuid 221850f8-2585-44fc-a6ac-7dc7e51b6f3e
                        stripe 6 devid 20 offset 9825187528704
                        dev_uuid 64d099f1-46f6-42fa-b89d-1544c92ab7b9
                        stripe 7 devid 13 offset 9829067259904
                        dev_uuid c15a9ec1-4c99-4bcd-b272-37d9b5c58de1
                        stripe 8 devid 11 offset 9831181189120
                        dev_uuid 56b79aac-0892-4f34-a610-ddc7706bc6d1
                        stripe 9 devid 19 offset 9830556237824
                        dev_uuid b258e15c-988a-418f-8eb8-c3978d5434b5

Case 2

I also discovered some chunk items with a different size than what I would have expected. Here we see 1040MB instead of 5GB. This means each stripe is 208MB (1040/5). This will become a problem as the filesystem fills up, creating fragmentation on the chunk level that can be difficult or slow to get out of.

        item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 186116328849408) itemoff 14443 itemsize 368
                length 1090519040 owner 2 stripe_len 65536 type DATA|RAID10
                io_align 65536 io_width 65536 sector_size 4096
                num_stripes 10 sub_stripes 2
                        stripe 0 devid 12 offset 2965927624704
                        dev_uuid baa68f83-b3c4-4d39-a3ab-9ee6d2c62b41
                        stripe 1 devid 19 offset 266507124736
                        dev_uuid b258e15c-988a-418f-8eb8-c3978d5434b5
                        stripe 2 devid 17 offset 2842195656704
                        dev_uuid 221850f8-2585-44fc-a6ac-7dc7e51b6f3e
                        stripe 3 devid 20 offset 267580866560
                        dev_uuid 64d099f1-46f6-42fa-b89d-1544c92ab7b9
                        stripe 4 devid 18 offset 2911133237248
                        dev_uuid b586966c-9b7a-42db-843c-1a6d0f420a18
                        stripe 5 devid 16 offset 2913280720896
                        dev_uuid c52a732e-73e9-460d-b075-61584b44a617
                        stripe 6 devid 11 offset 2958159773696
                        dev_uuid 56b79aac-0892-4f34-a610-ddc7706bc6d1
                        stripe 7 devid 15 offset 2914354462720
                        dev_uuid ac4afb76-8475-4af4-9bf4-5b8ce0e912dd
                        stripe 8 devid 13 offset 2956045844480
                        dev_uuid c15a9ec1-4c99-4bcd-b272-37d9b5c58de1
                        stripe 9 devid 14 offset 2969350176768
                        dev_uuid d501f99f-7f76-4dcd-a6b2-efa64f4a8bde

Balance Filters

  • For case 1, I see that an exclusion filter would be very useful. We could simply mask devid 27 for the devid filter:
btrfs balance start -ddevid=!27
  • For case 2, a filter for chunk size would be useful, possibly combined with a exclusion mask.
btrfs balance start -dchunk_len=1090519040
btrfs balance start -dchunk_len=!5G

Forza-tng avatar Jun 03 '25 14:06 Forza-tng

For case 1, an exclusion filter is exactly right, but it would be better placed in python-btrfs. The kernel interface is a little clunky for specifications like "include devices 10, 12, 14" or "exclude devices 3 and 5". Full generality requires bitmaps, variable-length data structures, or string parsing in the kernel, and I don't think anyone wants those. Device selection lists are utterly trivial to implement in python one block group at a time, and the end result is identical or better to running balance in the kernel.

For case 2, simply removing badly-sized block groups isn't sufficient--the holes they leave behind will also be badly-sized, so a later chunk allocation will simply fill them in again. With striped profiles, this leads to btrfs slicing block groups smaller and smaller until there are hundreds of thousands of them, with a length of 1M each--then the allocator starts to fail.

For case 2, the solution has 4 parts:

Part 1: set device sizes

The first step is resizing every device to an integer multiple of 1 GiB + 1M:

resizeDev () {
    dev="$1"
    shift
    btrfsDev="$1"
    shift

    devSize=$(blockdev --getsize64 $dev)
    devRoundedGiB=$((devSize / 1024 / 1024 / 1024))
    devMiB=$((devRoundedGiB * 1024 + 1))M
    btrfs fi resize $btrfsDev:$devMiB /fs
}

resizeDev /dev/sdg1 11
resizeDev /dev/sdc1 12
...etc...

btrfs reserves the first 1 MB of each device for superblocks and OS usage. This is counted in the device size and offset, but btrfs will never allocate there. So we have to add 1 MiB to every offset, including the device size.

Part 2: set system chunk size

echo $((1024*1024*1024)) | tee /sys/fs/btrfs/*/allocation/system/chunk_size

Note: you need a kernel patch to allow this:

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 14f53f757555..8bdcb0c90ae3 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -753,8 +753,8 @@ static ssize_t btrfs_chunk_size_show(struct kobject *kobj,
  * Store new chunk size in space info. Can be called on a read-only filesystem.
  *
  * If the new chunk size value is larger than 10% of free space it is reduced
- * to match that limit. Alignment must be to 256M and the system chunk size
- * cannot be set.
+ * to match that limit. Alignment must be to 256M for data and metadata chunks,
+ * 32M for system chunks.
  */
 static ssize_t btrfs_chunk_size_store(struct kobject *kobj,
                                      struct kobj_attribute *a,
@@ -764,6 +764,7 @@ static ssize_t btrfs_chunk_size_store(struct kobject *kobj,
        struct btrfs_fs_info *fs_info = to_fs_info(get_btrfs_kobj(kobj));
        char *retptr;
        u64 val;
+       u64 min_size;
 
        if (!capable(CAP_SYS_ADMIN))
                return -EPERM;
@@ -774,9 +775,12 @@ static ssize_t btrfs_chunk_size_store(struct kobject *kobj,
        if (btrfs_is_zoned(fs_info))
                return -EINVAL;
 
-       /* System block type must not be changed. */
-       if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
-               return -EPERM;
+       /* System block type must not be smaller than minimum. */
+       if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
+               min_size = SZ_32M;
+       } else {
+               min_size = SZ_256M;
+       }
 
        val = memparse(buf, &retptr);
        /* There could be trailing '\n', also catch any typos after the value */
@@ -789,11 +793,11 @@ static ssize_t btrfs_chunk_size_store(struct kobject *kobj,
        /* Limit stripe size to 10% of available space. */
        val = min(mult_perc(fs_info->fs_devices->total_rw_bytes, 10), val);
 
-       /* Must be multiple of 256M. */
-       val &= ~((u64)SZ_256M - 1);
+       /* Must be multiple of min size. */
+       val &= ~((u64)min_size - 1);
 
-       /* Must be at least 256M. */
-       if (val < SZ_256M)
+       /* Must be at least 256M for data, 32M for system. */
+       if (val < min_size)
                return -EINVAL;
 
        btrfs_update_space_info_chunk_size(space_info, val);

This patch removes the restriction on system chunk sizes, so they can be 1 GiB like every other chunk.

The patch isn't necessary if you are already using dedicated SSD devices for metadata. Since the system chunk is metadata, it will not affect the alignment of any striped data chunk, as data chunks will be on different devices.

After applying parts 1 and 2, it becomes possible to fully allocate all devices without creating misaligned or badly sized block groups. Parts 3 and 4 eliminate existing misaligned and badly sized block groups.

Part 3: balance misaligned block groups

for offset in $(seq 0 $devRoundedGiB); do
    for devid in $(...list of devices...); do
        doffset=$(( (offset * 1024 + 1) * 1024 * 1024))
        btrfs balance start -ddevid=$devid,doffset=$((doffset+1))..$((( doffset + 1024 * 1024 * 1024 - 1)) /fs
    done
done

Note that the doffset parameter intentionally excludes any block group with a starting offset that is exactly N * 1024 + 1 MiB, so block groups that are already aligned on all devices are not touched.

Note that the filesystem requires a few % free space due to limitations of the allocator used by balance; otherwise, it will just push data around the devices without fixing anything. When there is sufficient unallocated space, btrfs will move data toward the end of the devices. After a while, the amount of contiguous free space at the beginning of the device will be larger than anywhere else, so btrfs will allocate chunks at the beginning--hopefully fully aligned. This process does require some supervision.

Part 4: balance block groups with low stripe count or low dev_extent length

This requires the length filter proposed above. The existing stripe filter can be used. Both of these can be implemented in python-btrfs instead, and combined with the loop in step 3. Indeed this is all far easier to implement in python than with btrfs-progs, because python doesn't require mapping back and forth between device filenames and btrfs devids all the time.

Zygo avatar Jun 03 '25 15:06 Zygo