btrfs-progs icon indicating copy to clipboard operation
btrfs-progs copied to clipboard

Ability for Dynamic Storage Tiering - NVME (superfast) + SSD (mid-tier) + HDD (slow) - manipulate 'brtfs balance' profiles

Open TheLinuxGuy opened this issue 1 year ago • 10 comments

Could brtfs implement a feature to support multiple devices of different speed/types with a profiling algorithm for data balancing? In other words - dynamic storage tiering.

Assume a user with a combined brtfs filesystem with:

  • 1TB NVME (tier 1)
  • 4TB SSD (tier 2)
  • 20TB HDD (tier 3)

To keep things simple, assume no redundancy in each tier. The goal the user is looking for is to ensure the maximum performance and for the storage in the filesystem to be as optimized as it can be within some customizable settings (e.g: how much nvme space should be left "free" for writeback caching of new I/O).

As I was thinking brtfs-balance already does some of the filesystem optimization by balancing disk space utilization evenly across each disk. This feature is asking for more options to change how should brtfs-balance should work and how new I/O writes are handled so that 'tier 1' is always the priority.

Least used data blocks not recently accessed would be "downgraded" or moved down to a lower tier if the user hasn't accessed those data blocks and as the filesystem usage grows demanding some purging / rebalance.

TheLinuxGuy avatar Apr 02 '23 02:04 TheLinuxGuy

Also from my research, it seems that Netgear may have forked brtfs already to achieve this and they implemented their own algorithm for storage tiering in their now defunct ReadyNAS OS.

See page 10 of https://www.downloads.netgear.com/files/GDC/READYNAS-100/ReadyNAS_FlexRAID_Optimization_Guide.pdf and https://unix.stackexchange.com/questions/623460/tiered-storage-with-btrfs-how-is-it-done?answertab=modifieddesc#tab-top

TheLinuxGuy avatar Apr 02 '23 02:04 TheLinuxGuy

I was not aware of that, thanks for the links. It seems that readynas is not maitained and I can't find any git repositories assuming it's built on top of linux. Their page also does not mention 'btrfs' anywhere. The storage tiers are a feature people ask for so no surprise that somebody implemented that outside of linux but merging that back would be desirable. I haven't seen the code so it's hard to tell in what way it was implemented and if it would be acceptable, vendors often don't have to deal with backward compatibility or long term support so it's "cheaper" to do their private extensions instead.

kdave avatar Apr 03 '23 17:04 kdave

There is the patch set for metdata-on-ssd somewhere. This, I think would be a good middle-ground if they were accepted into mainline kernel. https://patchwork.kernel.org/project/linux-btrfs/patch/[email protected]/

Forza-tng avatar Apr 03 '23 18:04 Forza-tng

https://www.downloads.netgear.com/files/GPL/ReadyNASOS_V6.10.8_WW_src.zip

The paths I looked at are:

btrfs-tools-4.16/debian/patches/0010-Add-btrfs-balance-sweep-subcommand-for-dat-tiering.patch
linux-4.4.218-x86_64/fs/btrfs

I haven't looked at the full diff since the kernel is pretty old and much has changed, but basically it looks like it adds another sort function to sort the devices in __btrfs_alloc_chunk2 (btrfs_create_chunk now) sorting the device by a class attribute. And then an ioctl for a "sweep" filter for balance.

Duncaen avatar Apr 04 '23 00:04 Duncaen

This would be a fantastic addition to Btrfs. I'd like to emphasize the importance of being able to specify sub-volume affinity. Imagine having sub-volumes for /, /var/log, and /home. Here's the concept:

  • Data from / has the highest priority and is initially stored on tier 1, but it can be moved to tier 3 when it's not actively used.
  • Data from /var/log is initially written to tier 3. If there's no free space available on that tier, the data is written on another tier.
  • Data from /home is initially written with priority on tier 1. If there's no space available, it can be moved to tier 2. Eventually, when it's not actively used, it can be shifted to tier 3.

In this system, data from / is given the highest priority for storage space on tier 1, with a lower priority for /var/log and /home on the same tier. Similarly, data from /var/log is given the highest priority for storage space on tier 3, with a lower priority for /var/log and /home on the same tier.

I imagine two parameters to implement this:

  • driver_write_priority: Allows users to define the order of data writes on the disks and set the priority of each sub-volume on the disks.
  • drive_unused_data: A parameter to handle data that is not actively used.

This level of control over data placement within sub-volumes would be a game-changer. It allows for finely tuned optimization of storage resources based on specific usage scenarios. It would further solidify Btrfs as a powerful and flexible file system for data management.

studyfranco avatar Nov 08 '23 13:11 studyfranco

@TheLinuxGuy , @studyfranco It might be worth for you to have a look at the Btrfs preferred metadata patches. https://github.com/kakra/linux/pull/26

They do not explicitly deal in tiers, but they do introduce metadata-only, metadata-preferred, data-only and data-preferred priorities.

Forza-tng avatar Nov 08 '23 18:11 Forza-tng

@TheLinuxGuy , @studyfranco It might be worth for you to have a look at the Btrfs preferred metadata patches. kakra/linux#26

They do not explicitly deal in tiers, but they do introduce metadata-only, metadata-preferred, data-only and data-preferred priorities.

Rebased to 6.6 LTS: https://github.com/kakra/linux/pull/31

kakra avatar Nov 26 '23 06:11 kakra

@TheLinuxGuy , @studyfranco It might be worth for you to have a look at the Btrfs preferred metadata patches. kakra/linux#26 They do not explicitly deal in tiers, but they do introduce metadata-only, metadata-preferred, data-only and data-preferred priorities.

Rebased to 6.6 LTS: kakra/linux#31

This is a very good begin. But, my use case (and my proposition) is more complex. I have a hybrid system, and BTRFS with this feature will be the best file system for home usage. No space loose, no compromise, and most adaptative when we want to play games.

studyfranco avatar Nov 29 '23 13:11 studyfranco

This is a very good begin. But, my use case (and my proposition) is more complex. I have a hybrid system, and BTRFS with this feature will be the best file system for home usage. No space loose, no compromise, and most adaptative when we want to play games.

Currently I'm solving it this way:

I have two NVMe drives, each drive has a 64GB meta-data-preferred partiton for btrfs. The remaining space is md-raid1, then bcache backing partition put into it. All HDDs (4x 4TB) are data-preferred partitions formatted on bcache writeback backend partition and attached to the md-raid1 cache.

This way, meta data is on native NVMe because bcache doesn't handle cow metadata very efficient, and I still get the benefits of having hot data on NVMe. I'm using these patches to exclude some IO traffic from being cached (e.g. backup or maintenance jobs with idle IO priority): https://github.com/kakra/linux/pull/32

I achieve cache hit rate of 96% and bypass-hits of 95% (IO requests that should have bypassed caching but already have been in cache) for a 800 GB cache and 4.2TB used btrfs storage.

Actually, combining bcache with preferred meta data worked magic: cache hit rates went up and response times went down a lot. Transfer rates peak around 2 GB/s which is slower than native NVMe but still very good. Average transfer rates are around 300-500 MB/s with data coming partially from cache and HDD. Migrating this setup from single-SSD to dual-NVMe improved perceived responsiveness a lot. Still, due to cow and btrfs-data-raid1, bcache cannot work optimally and wastes some space and performance. A better integration of both would be useful where bcache would know about btrfs-raid1 and store data just once, or cow would inform bcache about unused blocks.

kakra avatar Nov 29 '23 13:11 kakra