btrfs-progs
btrfs-progs copied to clipboard
Feature request: Add more balancing filters
It think it would be useful to have additional balancing filters for more fine-grained control over what chunks that should be balanced.
My use-case is an unbalanced conversion from RAID1 to RAID10:
Data,RAID10: Size:55031.33GiB, Used:54990.07GiB (99.93%)
/dev/sdg1 10404.27GiB
/dev/sdc1 10403.27GiB
/dev/sdk1 10404.27GiB
/dev/sdp1 10404.27GiB
/dev/sdx1 10403.27GiB
/dev/sdad1 10404.27GiB
/dev/sdab1 10404.27GiB
/dev/sdz1 10403.27GiB
/dev/sdw1 10403.27GiB
/dev/sdy1 10404.27GiB
/dev/sda1 6024.00GiB
Case 1
Using btrfs inspect-internal dump-tree --device we can see for each chunk item what stripe is on what device. For example, item 178421878161408 does not have a stripe on devid 27, which is /dev/sda1.
item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 178416509452288) itemoff 14683 itemsize 368
length 5368709120 owner 2 stripe_len 65536 type DATA|RAID10
io_align 65536 io_width 65536 sector_size 4096
num_stripes 10 sub_stripes 2
stripe 0 devid 15 offset 10659036135424
dev_uuid ac4afb76-8475-4af4-9bf4-5b8ce0e912dd
stripe 1 devid 16 offset 10659036135424
dev_uuid c52a732e-73e9-460d-b075-61584b44a617
stripe 2 devid 18 offset 9819370029056
dev_uuid b586966c-9b7a-42db-843c-1a6d0f420a18
stripe 3 devid 17 offset 9829033705472
dev_uuid 221850f8-2585-44fc-a6ac-7dc7e51b6f3e
stripe 4 devid 14 offset 9835711037440
dev_uuid d501f99f-7f76-4dcd-a6b2-efa64f4a8bde
stripe 5 devid 13 offset 9818329841664
dev_uuid c15a9ec1-4c99-4bcd-b272-37d9b5c58de1
stripe 6 devid 11 offset 9829033705472
dev_uuid 56b79aac-0892-4f34-a610-ddc7706bc6d1
stripe 7 devid 20 offset 9833777463296
dev_uuid 64d099f1-46f6-42fa-b89d-1544c92ab7b9
stripe 8 devid 19 offset 9828408754176
dev_uuid b258e15c-988a-418f-8eb8-c3978d5434b5
stripe 9 devid 27 offset 10711649484800
dev_uuid d1fe6aa3-b1bf-4c4d-aff7-c88f3d3b5970
item 8 key (FIRST_CHUNK_TREE CHUNK_ITEM 178421878161408) itemoff 14315 itemsize 368
length 5368709120 owner 2 stripe_len 65536 type DATA|RAID10
io_align 65536 io_width 65536 sector_size 4096
num_stripes 10 sub_stripes 2
stripe 0 devid 14 offset 9837858521088
dev_uuid d501f99f-7f76-4dcd-a6b2-efa64f4a8bde
stripe 1 devid 15 offset 10661183619072
dev_uuid ac4afb76-8475-4af4-9bf4-5b8ce0e912dd
stripe 2 devid 16 offset 10653667426304
dev_uuid c52a732e-73e9-460d-b075-61584b44a617
stripe 3 devid 18 offset 9830107447296
dev_uuid b586966c-9b7a-42db-843c-1a6d0f420a18
stripe 4 devid 12 offset 9833362227200
dev_uuid baa68f83-b3c4-4d39-a3ab-9ee6d2c62b41
stripe 5 devid 17 offset 9831181189120
dev_uuid 221850f8-2585-44fc-a6ac-7dc7e51b6f3e
stripe 6 devid 20 offset 9825187528704
dev_uuid 64d099f1-46f6-42fa-b89d-1544c92ab7b9
stripe 7 devid 13 offset 9829067259904
dev_uuid c15a9ec1-4c99-4bcd-b272-37d9b5c58de1
stripe 8 devid 11 offset 9831181189120
dev_uuid 56b79aac-0892-4f34-a610-ddc7706bc6d1
stripe 9 devid 19 offset 9830556237824
dev_uuid b258e15c-988a-418f-8eb8-c3978d5434b5
Case 2
I also discovered some chunk items with a different size than what I would have expected. Here we see 1040MB instead of 5GB. This means each stripe is 208MB (1040/5). This will become a problem as the filesystem fills up, creating fragmentation on the chunk level that can be difficult or slow to get out of.
item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 186116328849408) itemoff 14443 itemsize 368
length 1090519040 owner 2 stripe_len 65536 type DATA|RAID10
io_align 65536 io_width 65536 sector_size 4096
num_stripes 10 sub_stripes 2
stripe 0 devid 12 offset 2965927624704
dev_uuid baa68f83-b3c4-4d39-a3ab-9ee6d2c62b41
stripe 1 devid 19 offset 266507124736
dev_uuid b258e15c-988a-418f-8eb8-c3978d5434b5
stripe 2 devid 17 offset 2842195656704
dev_uuid 221850f8-2585-44fc-a6ac-7dc7e51b6f3e
stripe 3 devid 20 offset 267580866560
dev_uuid 64d099f1-46f6-42fa-b89d-1544c92ab7b9
stripe 4 devid 18 offset 2911133237248
dev_uuid b586966c-9b7a-42db-843c-1a6d0f420a18
stripe 5 devid 16 offset 2913280720896
dev_uuid c52a732e-73e9-460d-b075-61584b44a617
stripe 6 devid 11 offset 2958159773696
dev_uuid 56b79aac-0892-4f34-a610-ddc7706bc6d1
stripe 7 devid 15 offset 2914354462720
dev_uuid ac4afb76-8475-4af4-9bf4-5b8ce0e912dd
stripe 8 devid 13 offset 2956045844480
dev_uuid c15a9ec1-4c99-4bcd-b272-37d9b5c58de1
stripe 9 devid 14 offset 2969350176768
dev_uuid d501f99f-7f76-4dcd-a6b2-efa64f4a8bde
Balance Filters
- For case 1, I see that an exclusion filter would be very useful. We could simply mask devid 27 for the devid filter:
btrfs balance start -ddevid=!27
- For case 2, a filter for chunk size would be useful, possibly combined with a exclusion mask.
btrfs balance start -dchunk_len=1090519040
btrfs balance start -dchunk_len=!5G
For case 1, an exclusion filter is exactly right, but it would be better placed in python-btrfs. The kernel interface is a little clunky for specifications like "include devices 10, 12, 14" or "exclude devices 3 and 5". Full generality requires bitmaps, variable-length data structures, or string parsing in the kernel, and I don't think anyone wants those. Device selection lists are utterly trivial to implement in python one block group at a time, and the end result is identical or better to running balance in the kernel.
For case 2, simply removing badly-sized block groups isn't sufficient--the holes they leave behind will also be badly-sized, so a later chunk allocation will simply fill them in again. With striped profiles, this leads to btrfs slicing block groups smaller and smaller until there are hundreds of thousands of them, with a length of 1M each--then the allocator starts to fail.
For case 2, the solution has 4 parts:
Part 1: set device sizes
The first step is resizing every device to an integer multiple of 1 GiB + 1M:
resizeDev () {
dev="$1"
shift
btrfsDev="$1"
shift
devSize=$(blockdev --getsize64 $dev)
devRoundedGiB=$((devSize / 1024 / 1024 / 1024))
devMiB=$((devRoundedGiB * 1024 + 1))M
btrfs fi resize $btrfsDev:$devMiB /fs
}
resizeDev /dev/sdg1 11
resizeDev /dev/sdc1 12
...etc...
btrfs reserves the first 1 MB of each device for superblocks and OS usage. This is counted in the device size and offset, but btrfs will never allocate there. So we have to add 1 MiB to every offset, including the device size.
Part 2: set system chunk size
echo $((1024*1024*1024)) | tee /sys/fs/btrfs/*/allocation/system/chunk_size
Note: you need a kernel patch to allow this:
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 14f53f757555..8bdcb0c90ae3 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -753,8 +753,8 @@ static ssize_t btrfs_chunk_size_show(struct kobject *kobj,
* Store new chunk size in space info. Can be called on a read-only filesystem.
*
* If the new chunk size value is larger than 10% of free space it is reduced
- * to match that limit. Alignment must be to 256M and the system chunk size
- * cannot be set.
+ * to match that limit. Alignment must be to 256M for data and metadata chunks,
+ * 32M for system chunks.
*/
static ssize_t btrfs_chunk_size_store(struct kobject *kobj,
struct kobj_attribute *a,
@@ -764,6 +764,7 @@ static ssize_t btrfs_chunk_size_store(struct kobject *kobj,
struct btrfs_fs_info *fs_info = to_fs_info(get_btrfs_kobj(kobj));
char *retptr;
u64 val;
+ u64 min_size;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -774,9 +775,12 @@ static ssize_t btrfs_chunk_size_store(struct kobject *kobj,
if (btrfs_is_zoned(fs_info))
return -EINVAL;
- /* System block type must not be changed. */
- if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
- return -EPERM;
+ /* System block type must not be smaller than minimum. */
+ if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
+ min_size = SZ_32M;
+ } else {
+ min_size = SZ_256M;
+ }
val = memparse(buf, &retptr);
/* There could be trailing '\n', also catch any typos after the value */
@@ -789,11 +793,11 @@ static ssize_t btrfs_chunk_size_store(struct kobject *kobj,
/* Limit stripe size to 10% of available space. */
val = min(mult_perc(fs_info->fs_devices->total_rw_bytes, 10), val);
- /* Must be multiple of 256M. */
- val &= ~((u64)SZ_256M - 1);
+ /* Must be multiple of min size. */
+ val &= ~((u64)min_size - 1);
- /* Must be at least 256M. */
- if (val < SZ_256M)
+ /* Must be at least 256M for data, 32M for system. */
+ if (val < min_size)
return -EINVAL;
btrfs_update_space_info_chunk_size(space_info, val);
This patch removes the restriction on system chunk sizes, so they can be 1 GiB like every other chunk.
The patch isn't necessary if you are already using dedicated SSD devices for metadata. Since the system chunk is metadata, it will not affect the alignment of any striped data chunk, as data chunks will be on different devices.
After applying parts 1 and 2, it becomes possible to fully allocate all devices without creating misaligned or badly sized block groups. Parts 3 and 4 eliminate existing misaligned and badly sized block groups.
Part 3: balance misaligned block groups
for offset in $(seq 0 $devRoundedGiB); do
for devid in $(...list of devices...); do
doffset=$(( (offset * 1024 + 1) * 1024 * 1024))
btrfs balance start -ddevid=$devid,doffset=$((doffset+1))..$((( doffset + 1024 * 1024 * 1024 - 1)) /fs
done
done
Note that the doffset parameter intentionally excludes any block group with a starting offset that is exactly N * 1024 + 1 MiB, so block groups that are already aligned on all devices are not touched.
Note that the filesystem requires a few % free space due to limitations of the allocator used by balance; otherwise, it will just push data around the devices without fixing anything. When there is sufficient unallocated space, btrfs will move data toward the end of the devices. After a while, the amount of contiguous free space at the beginning of the device will be larger than anywhere else, so btrfs will allocate chunks at the beginning--hopefully fully aligned. This process does require some supervision.
Part 4: balance block groups with low stripe count or low dev_extent length
This requires the length filter proposed above. The existing stripe filter can be used. Both of these can be implemented in python-btrfs instead, and combined with the loop in step 3. Indeed this is all far easier to implement in python than with btrfs-progs, because python doesn't require mapping back and forth between device filenames and btrfs devids all the time.