python-btrfs check btrfs: some alternative check modes

Hey.

Right now, check_btrfs allows to check for allocated/unallocated space. btw: The option names and their descriptions may be a bit ambiguous with respect to allocated vs. unallocated.

First: I'm not an expert, but is it (still) necessarily a problem if btrfs runs out of unallocated space? I mean it has the global reserve... and if there is plenty of free space left in data respectively meta block groups, not having any unallocated space may not mean much, unless perhaps that balance will no longer be possible (which may however not be an issue). And if there's still enough free space in the already allocated data respectively meta data block groups, ... most other things should continue to run just fine?

At the university here I run a large data centre for the LHC at CERN, and recently stumbled into the following situation: https://lore.kernel.org/linux-btrfs/CAHzMYBR9dFVTw5kJ9_DfkcuvdrO4x+koicfiWgVNndh8qU2aEw@mail.gmail.com/T/#t which I would like to be detectable by check_btrfs.

What happened there is basically that unallocated space was completely used up, quite some space (~800GB) was still free in the data block groups, but nothing (usable) was free in the meta-data block groups. So while there would have been space for the file data itself, nothing was left for new file metadata, so one got ENOSPC.

Of course, just checking for low unallocated space would also detect it, but would also ring the bell too often (namely when there's still enough left in the meta-data block groups.

So not sure how one could detect this better... maybe a check when unallocated space is below some threshold (or even 0) and free metadata space is also below some other threshold, while free data space is still above some threshold (otherwise, it would also report a fs, that's simply full).

Any ideas?

Cheers, Chris.

Dec 15 '21 00:12 calestyo

is it (still) necessarily a problem if btrfs runs out of unallocated space?

No, but it's not necessarily not a problem either. If your metadata stays the same size or gets smaller, there's no problem. If the metadata gets larger (which can happen even if the data is the same size, due to decreased average extent size, more directories or hardlinks, symlinks, xattrs, reflinks, snapshots, etc) and there's no free space left, the filesystem comes to a hard stop on ENOSPC.

Ideally when the last block groups is allocated on the filesystem, there will be sufficient metadata allocations to fill up the data block groups and nothing more needs to be done; however, this cannot be guaranteed because metadata can be made arbitrarily large after all space on the disks has been allocated (e.g. by while :; do mkdir $((counter++)); done)

mean it has the global reserve

The global reserve provides enough metadata to get through one btrfs transaction. This reserve allows you to add more disks if you've really run out of space, or delete some files or a snapshot, or maybe balance a data block group. It doesn't enable arbitrary metadata growth (e.g. touching every file in a snapshot)--you need unallocated space (or previously allocated metadata space) for that.

Of course, just checking for low unallocated space would also detect it, but would also ring the bell too often

When low (but above zero) unallocated space is detected, start a data balance. When there is abundant unallocated space, cancel the data balance. Automate this with scheduling appropriate for your data and usage patterns to implement a minimal data balance regime. Most users need 3 + number_of_devices GB of free space for metadata at all times (on a filesystem over 50GB). Balance is (intentionally?) slower than data writing, so you'll need a higher free space target to allow for the amount of new data that will be added to the filesystem while the balance is running.

As the filesystem gets closer to full, the data balance will run into the knapsack problem and will not release any more unallocated space; however, if you've been running balances while the filesystem fills up, you'll have a bit of extra metadata space allocated, and the filesystem will run out of data space before metadata space.

Playing tug-of-war with data balances and unallocated space is messy and inefficient in iops. The other way to do it is to mount with -o metadata_ratio=33 and simply overallocate the metadata space at all times (which is inefficient in space, but if you pick the right ratio you'll never need to balance). This will allocate 3% of the space for metadata, which is several times more than usually needed--but metadata will not run out except under extreme conditions (the kind you have to use special mkfs options for on ext4).

Some suggestions that involve modifying the filesystem or the on disk data format:

The allocator has some obvious opportunities for speedups as the filesystem gets full. Right now the last 1GB of filesystem takes as long to write as the first 100 TB. There's a patch already queued up that helps with this.
Block groups are fixed size (multiples of 1GB each), extents are variable but immutable size (4K to 128MB each). This creates a knapsack problem when filling block groups, which means a few percent of the filesystem space can never be recovered in a data balance. The space can be used for data writes, which is why it's important to run minimal data balancing proactively, before running too low on space.
The global reserve currently serves two functions: it acts as a minimum amount of free space, and also as a maximum transaction (and kernel memory) size. These concepts could be separated, so that the maximum transaction size stays at 512 MB, but the minimum amount of free metadata space could go up by an order of magnitude to ensure metadata has room for some growth.
The latter could be made a mount option so that it can be tuned for specific workloads. If you have a fleet of machines all doing the same thing, you can figure out how much space you need, if that number is different from everyone else's needs, and allocate it in advance.
Mixed block groups combine metadata and data into a single allocator; however, there are good performance-related reasons for keeping them separate. Every metadata allocation is exactly the same (small) size, and every data allocation is different sizes, so while it helps with the ENOSPC problem, it hurts performance and the free space fragmentation problems.
Simply embed the "3 + num_devices GB" heuristic into the metadata chunk allocator, and call it a day.

Dec 15 '21 04:12 Zygo

Hey Zygo.

No, but it's not necessarily not a problem either. If your metadata stays the same size or gets smaller, there's no problem. If the metadata gets larger (which can happen even if the data is the same size, due to decreased average extent size, more directories or hardlinks, symlinks, xattrs, reflinks, snapshots, etc) and there's no free space left, the filesystem comes to a hard stop on ENOSPC.

I see.

With respect to btrfs itself, that probably means there's no golden solution... because the above could always easily happen, e.g. that one has still plenty of free space in data block groups... and then suddenly starts creating new dirs, or xattrs, etc..

But the point here was mainly with respect to check_btrfs... i.e. is checking for low unallocated space alone and with no further conditions really still a good indicator for anything?

All below is probably related to btrfs itself... and we should perhaps move that discussion back to the thread on linux-btrfs... (where I haven't had time to answer yet). If you don't mind I'd simply quote your stuff from here over at linux-btrfs when I find time to reply there.

Ideally when the last block groups is allocated on the filesystem, there will be sufficient metadata allocations to fill up the data block groups and nothing more needs to be done; however, this cannot be guaranteed because metadata can be made arbitrarily large after all space on the disks has been allocated (e.g. by while :; do mkdir $((counter++)); done)

mean it has the global reserve

The global reserve provides enough metadata to get through one btrfs transaction. This reserve allows you to add more disks if you've really run out of space, or delete some files or a snapshot, or maybe balance a data block group. It doesn't enable arbitrary metadata growth (e.g. touching every file in a snapshot)--you need unallocated space (or previously allocated metadata space) for that.

Of course, just checking for low unallocated space would also detect it, but would also ring the bell too often

When low (but above zero) unallocated space is detected, start a data balance. When there is abundant unallocated space, cancel the data balance. Automate this with scheduling appropriate for your data and usage patterns to implement a minimal data balance regime. Most users need 3 + number_of_devices GB of free space for metadata at all times (on a filesystem over 50GB). Balance is (intentionally?) slower than data writing, so you'll need a higher free space target to allow for the amount of new data that will be added to the filesystem while the balance is running.

As the filesystem gets closer to full, the data balance will run into the knapsack problem and will not release any more unallocated space; however, if you've been running balances while the filesystem fills up, you'll have a bit of extra metadata space allocated, and the filesystem will run out of data space before metadata space.

Playing tug-of-war with data balances and unallocated space is messy and inefficient in iops. The other way to do it is to mount with -o metadata_ratio=33 and simply overallocate the metadata space at all times (which is inefficient in space, but if you pick the right ratio you'll never need to balance). This will allocate 3% of the space for metadata, which is several times more than usually needed--but metadata will not run out except under extreme conditions (the kind you have to use special mkfs options for on ext4).

Some suggestions that involve modifying the filesystem or the on disk data format:
* The allocator has some obvious opportunities for speedups as the filesystem gets full.  Right now the last 1GB of filesystem takes as long to write as the first 100 TB.  There's a patch already queued up that helps with this.

* Block groups are fixed size (multiples of 1GB each), extents are variable but immutable size (4K to 128MB each).  This creates a knapsack problem when filling block groups, which means a few percent of the filesystem space can never be recovered in a data balance.  The space can be used for data writes, which is why it's important to run minimal data balancing proactively, _before_ running too low on space.

* The global reserve currently serves two functions:  it acts as a minimum amount of free space, and also as a maximum transaction (and kernel memory) size.  These concepts could be separated, so that the maximum transaction size stays at 512 MB, but the minimum amount of free metadata space could go up by an order of magnitude to ensure metadata has room for some growth.

* The latter could be made a mount option so that it can be tuned for specific workloads.  If you have a fleet of machines all doing the same thing, you can figure out how much space you need, if that number is different from everyone else's needs, and allocate it in advance.

* Mixed block groups combine metadata and data into a single allocator; however, there are good performance-related reasons for keeping them separate.  Every metadata allocation is exactly the same (small) size, and every data allocation is different sizes, so while it helps with the ENOSPC problem, it hurts performance and the free space fragmentation problems.

* Simply embed the "3 + num_devices GB" heuristic into the metadata chunk allocator, and call it a day.

Dec 16 '21 04:12 calestyo

Hi!

Checking if you're running out of unallocated raw disk space is indeed useful to get early signals about the problem that you're facing now.

About the question on the mailing list about "Is there some way to see a distribution of the space usage of block groups?" Yes, there's btrfs-search-metadata in this project that provides convenient ways to show much of this kind of information. If you like a picture better, there's btrfs-heatmap.

Hans

Dec 17 '21 09:12 knorrie

python-btrfs python-btrfs copied to clipboard

check btrfs: some alternative check modes

python-btrfs
python-btrfs copied to clipboard