bees icon indicating copy to clipboard operation
bees copied to clipboard

Possible to limit minimum dedupe extent size?

Open Donwulff opened this issue 4 years ago • 8 comments

I know the documentation basically suggests setting hash table size based on intended minimum dedupe extent size, but based on log most extents seem to still be significantly smaller. Another issue report discussed sizing hash table so that no buckets fill up, which goes against the sizing advice in the README and produces minimum extent sizes.

Having an option to set up the minimum dedupe extent size unless whole file, say to 32MiB would seem like it should allow bees to play nice with btrfs defrag. (Except I think compression produces 128KiB extents, so I'm not certain about interplay of all options) What even happens if I run bees on partition with autodefrag option? #121 #117 are related but this is specific feature request/question.

Donwulff avatar Oct 07 '19 18:10 Donwulff

The "minimum dedupe extent size" discussed in the documentation about hash table sizing is the behavior that is the result of setting the hash table to a particular size. In reality the hash table stores data from a sliding window over the filesystem, and the minimum extent size required for a dedupe match increases continuously (starting from 4K) as the distance between two matching data blocks increases ("distance" is the number of non-matching blocks that are scanned between a pair of matching blocks).

If you do the math on the hash table and filesystem size and get an extent size of 32K, it means that bees will dedupe 4K..128MB matches, but averaged over an entire filesystem, most of the extents that are deduped will be 32K or larger (assuming randomly distributed extent sizes, uniformly distributed content, and various other statistical preconditions). The portion of matching extents that are successfully deduped decreases rapidly as the extent size goes below 32K, but a 4K match is always possible if both matching blocks are written close enough together.

Currently the bees code assumes a single block size throughout (specifically the filesystem clone_alignment size). 4K is the only block size in bees. The bees code walks over data blocks with a single pair of 4K IO buffers, it matches adjacent blocks 4K at a time, and it splits (rewrites) extents on 4K boundaries. The 4K blocks are batched together in contiguous groups for the final dedupe and copy calls to the kernel, but that's currently the only point in the code where any unit of data larger than 4K is considered.

It sucks. I know. I'm working on a replacement which operates on complete extent pairs, instead of just one or two 4K blocks at a time. This can integrate dedupe, defrag, and btrfs garbage collection into a single pass over the filesystem, as well as fixing a lot of annoying design problems within bees (like the lack of any ability to deal with blocks of non-uniform size or sizes over 4K). It's a rewrite of the bees code.

In my testing, autodefrag by itself adds far too much IO latency to be usable. I've never tried autodefrag and bees at the same time, but I don't see how combining them could be a good thing. bees and autodefrag do opposite things, and they might treat each other's output as new data, so they can create a feedback loop, with bees splitting extents apart exclusively based on content and autodefrag joining them back together again exclusively based on size.

btrfs fi defrag has many problems because it is the wrong approach to managing fragmentation on btrfs. The idea seems to have been copied from filesystems that have mutable extents, and on such filesystems bigger extents are always better. btrfs has immutable extents, so there's an optimal extent size--there are problems when extents get too large as well as too small. The btrfs fi defrag tool can't play well with extent-mutating block-level dedupe. It's OK for file-oriented dedupe, like bedup or jdupes, because these programs won't do partial extent matches, but you still waste a lot of IO: either you defrag a file unnecessarily because it will be removed by a later dedupe, or you do dedupe first and then btrfs fi defrag can't fix the fragmentation without undoing the dedupe extent sharing.

I don't think it's possible to fix btrfs fi defrag or the kernel autodefrag feature without changing the interface so much that the result is no longer recognizable as the same feature. I don't think there's a way to make bees play nicely with them--if there was, I'd be using them now, instead of building their replacements.

Zygo avatar Oct 07 '19 21:10 Zygo

Thanks for the detailed response. I roughly grasped the functioning of the hash-table, and why it doesn't do what I want (ie. limit the minimum match), which is why I thought it might not be too hard to have a configurable limit to skip deduping smaller stretches (if not the whole file) after finding them. Actually running fs defrag on some files I found out the disk space usage dropped (compression not enabled) making me wonder if the metadata overhead is more than 4KiB. Observing fs defrag it appears to just re-write all files whether they're fragmented or not, breaking all duplicate extents even when they're larger than 32MiB (the default "don't defragment" length). No idea what's going on with compressed extents. So certainly a combined approach would be best, just sounds a lot of work.

For some context, I work a lot with bioinformatics where genomic sequences tend to have long stretches of N's to indicate unknown/uninterested locations or things like centromeric repeat arrays which can repeat for long stretches. It can be kind of big deal when streaming a terabyte of genetic data analyzing or looking some motif with constant seeks for 4KiB segments, at the same time there is a LOT of opportunities for reducing duplication with large files having the same beginning, etc.

Donwulff avatar Oct 07 '19 22:10 Donwulff

The idea seems to have been copied from filesystems that have mutable extents, and on such filesystems bigger extents are always better. btrfs has immutable extents, so there's an optimal extent size--there are problems when extents get too large as well as too small.

Well, I think btrfs also has mutable extents in the nodatacow case, i.e. files marked as nocow. Let's exclude the fact that we can take snapshots of such files which makes extents of such a file more or less immutable. Maybe btrfs fi defrag should learn filters to only work on nocow extents as an option?

kakra avatar Oct 07 '19 23:10 kakra

btrfs also has mutable extents in the nodatacow case, i.e. files marked as nocow.

...if they are not shared.

Maybe btrfs fi defrag should learn filters to only work on nocow extents as an option?

Sure, that would be a use case for the existing defrag. You'd run it after you did a snapshot to backup a nodatacow file (which would fragment the file if it was written during the backup) and you wanted the file to be contiguous again.

Since nodatacow is a per-inode attribute, you won't get mutable extents randomly in the filesystem, only in nodatacow files, so a first-level filter for btrfs fi defrag would be to just skip any file that doesn't have the nodatacow flag set.

Another use case would be to recompress a file using a different compression algorithm. This isn't really 'defrag' any more, though. More like an atomic in-place copy function where you can pick the copy's encoding.

Actually running fs defrag on some files I found out the disk space usage dropped

That might be due to unreachable blocks in the extents. If you have a 32M extent, and you overwrite 31M of the extent, the disk usage is 63M. The first 32M extent is immutable, so it remains present in its entirety until the last reference to it is removed. Some of the blocks in the 32M extent may not be reachable through any file on the filesystem. These unreachable blocks occupy space until the last reference is removed. If you overwrite 30M of the 31M, then the disk usage is now 93M (32M + 31M + 30M). This can continue until a file is occupying 32768 times its nominal disk space in the worst case (though more than 50% unreachable is rare outside of contrived test cases).

btrfs fi defrag will remove such references, but only in the process of blindly rewriting the whole file. defrag could be enhanced to check the references to every extent in a file to ensure all the blocks in each extent are still reachable, and only rewrite extents where some blocks aren't.

bees will try to replace entire extents that are only partially deduped; however, it currently doesn't check to see if an extent it is using as dedupe src contains unreachable blocks. This leads to extents with unreachable blocks having lots of references, making them even harder to get rid of.

a combined approach would be best, just sounds a lot of work.

It's a lot of work, but it's necessary to get bees past the limitations of its current architecture. I'm pretty sure there's at least two more orders of magnitude performance gain to squeeze out of bees, but the current code can't go there without getting really ugly.

Also, I really need a tool that can defrag at scale without conflicting with dedupe, and nobody else seems to be working on one... ;)

genomic sequences tend to have long stretches of N's to indicate unknown/uninterested locations or things like centromeric repeat arrays which can repeat for long stretches.

Current bees has several problems with this kind of data. bees stops looking for matches once it has matched 4K, and always chooses the first 4K block it encounters even if that block is in a bad extent (too small or unreachable space). Microsoft mailboxes (NUL encoded as base-64 is 'A') and ext4 filesystem images (lots of 0xff in block bitmaps) are other bad cases that bees clumsily works around. Usually the workaround is to just skip over such extents.

There are currently btrfs problems with multi-million-reference extents, so it's not such a bad thing if bees gives up after a few thousand. A million times smaller is better than 1000 times smaller, but only 0.0999% better, and much much faster.

Ideally, bees will choose which extent to keep and which to discard, and also use longest matches instead of first matches. Thus bees would keep one extent with 32768 blocks of 'N' and just reflink parts of it as required, so a thousand reflinks eliminates 12.8GB of duplicate data with 128MB extents, instead of 4MB of data with 4K extents. The rewrite enables that by comparing and removing entire extents instead of individual blocks.

Zygo avatar Oct 08 '19 03:10 Zygo

In my testing, autodefrag by itself adds far too much IO latency to be usable. I've never tried autodefrag and bees at the same time, but I don't see how combining them could be a good thing. bees and autodefrag do opposite things, and they might treat each other's output as new data, so they can create a feedback loop, with bees splitting extents apart exclusively based on content and autodefrag joining them back together again exclusively based on size.

This means that I should disable autodefrag for SSDs when using bees. , it seems that this will greatly degrade the I / O performance...

daiaji avatar Oct 30 '19 15:10 daiaji

it seems that this will greatly degrade the I / O performance...

Do you have data that supports that claim? In my testing, I've found autodefrag falls somewhere between useless and actively harmful to performance, whether running bees or not.

I haven't done a proper study on it though, like, I don't have pretty graphs showing the autodefrag performance nosedive. I enabled autodefrag on some test servers and they were unusable 12 hours later, I disabled autodefrag and they were fast again.

Zygo avatar Oct 30 '19 16:10 Zygo

I think autodefrag is two-fold. Yes, it slows down IO performance over a period of time. But when disabling it, performance returns to better stats than before enabling it. At least this is what I observed on two desktop systems. Even for SSD it may make sense to enable it temporary once in a while because it recombines extents and thus reduces the IOPS amplification that may have been caused by CoW.

Since autodefrag tends to fight with bees, it's advised to use it only temporary, then, while it is active, do some IO which usually fragmented files a lot (sqlite files or other single-file databases with write patterns not working well for btrfs). After this, disable autodefrag again and reboot. System feels snappier again.

kakra avatar Oct 30 '19 17:10 kakra

We should be able to quantify this. One of the requirements for bees defrag is that it have a net positive effect, i.e. it has to be at least better than doing nothing. To determine if bees meets this requirement, we need a way to measure a control subject, measure a subject running bees defrag, leave them both alone for a while and see which ends up in a better state. So my challenge is: what are the performance numbers that matter? What does a benchmark look like?

How much IO does autodefrag need, how long does it have to run for a benefit to appear, how much benefit is achieved before diminishing returns hit zero, how do the numbers for autodefrag compare to other btrfs extent allocation/relocation strategies?

If the answers for autodefrag are "you have to run it for 1 hour a day to save 6 minutes of IO time during the rest of the day" then you save a net 6.3 hours of IO time (and SSD wear and FTL fragmentation and all the stuff that goes with it) every week by not using autodefrag. "do nothing" is faster because it's not as slow as autodefrag.

There are alternatives. You could do 3 minutes of IO per day to balance a random data block group to ensure there are large contiguous free space areas, and reduce commit latency (which is inversely proportional to metadata density and therefore affected by fragmentation) over a period of several days. This keeps the btrfs allocator fast and also helps prevent metadata ENOSPC failures; on the other hand, it doesn't improve sequential read or delete performance. You could complement this with 5 minutes per day of btrfs fi defrag on the most recently modified files, in that order so that the defrag can take advantage of the contiguous free space created by balance. This gives better sequential read and delete performance and frees some space due to the garbage collection side-effect of defrag.

In my testing, autodefrag doesn't trigger at all (so net effect is zero), or goes all the way to 24 hours per day IO time (or whatever it could steal from applications)--at which point performance gains are negative and eventually plateau (i.e. it reaches a point where autodefrag runs out of capacity to make it any worse). I didn't run tests where autodefrag was disabled some of the time, maybe that produces a performance gain curve that is sometimes above zero?

Zygo avatar Oct 31 '19 16:10 Zygo