bees icon indicating copy to clipboard operation
bees copied to clipboard

Deduplicating existing BTRFS storage

Open mikhailnov opened this issue 5 years ago • 26 comments

I've read docs and still don't understand:

  • let files A and B have the same hashes (they are completely the same)
  • A and B are located on the same subvolume
  • can bees deduplicate A and B and so increase free space by the size of that file?

Could you please clearify that?

mikhailnov avatar Jun 26 '19 12:06 mikhailnov

The fact that A and B are on the same subvolume is irrelevant, bees works at the btrfs filesystem level (so all subvolumes, including snapshots, are checked).

Usually if you have A and B, and no snapshots*, yes, it should increase the free space by the size of that file.

Note that if you run bees on a preexisting filesystem, this can take a very long time (on my multi-Tb filesystem, with 100+ snapshots, I think it took 2 weeks). But then once this is done, bees works incrementally, by checking what changed between each transaction, and then it's less i/o intensive.

If you have two big A and B files, with a slight difference, bees can still dedupe most of the blocks. It's because bees works at the blocks/extents level.

Don't hesitate to build a btrfs filesystem in ram and test bees on it to see if it works how you expect it to.

* if you have snapshots, you might still have a reference to the blocks of the B file you wanted to free, in one of the snapshots, so the blocks will be freed only once bees has finished to scan the entire FS and actually deduped every reference to the B file.

speed47 avatar Jun 26 '19 18:06 speed47

Thank you for detailed explanation. How much load does this operation make? I have a 2-core CPU, will the filestore remain usable via NFS/Samba while being deduplicated?

mikhailnov avatar Jun 26 '19 19:06 mikhailnov

It'll mainly be I/O bound, not son much CPU bound, especially for HDDs (i.e. non-SSD).

You can configure bees to be nicer, see --loadavg-target. Of course it'll take longer to complete. Note that you can interrupt it anytime however.

Le 26 juin 2019 21:40:19 mikhailnov [email protected] a écrit :

Thank you for detailed explenation. How much load does this operation make? I have a 2-core CPU, will the filestore remain usable via NFS/Samba while being deduplicated?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

speed47 avatar Jun 26 '19 21:06 speed47

I meant that high I/O load may make the system near to unusable. I forgot about "nice", thank you!

27.06.2019 0:25, Stéphane Lesimple пишет:

It'll mainly be I/O bound, not son much CPU bound, especially for HDDs (i.e. non-SSD).

You can configure bees to be nicer, see --loadavg-target. Of course it'll take longer to complete. Note that you can interrupt it anytime however.

Le 26 juin 2019 21:40:19 mikhailnov [email protected] a écrit :

Thank you for detailed explenation. How much load does this operation make? I have a 2-core CPU, will the filestore remain usable via NFS/Samba while being deduplicated?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Zygo/bees/issues/116?email_source=notifications&email_token=ADYSBIG3CICTY6LV22HLCG3P4PNEDA5CNFSM4H3RS2Z2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYU3UWI#issuecomment-506051161, or mute the thread https://github.com/notifications/unsubscribe-auth/ADYSBIHCEZSZSSYFA3OR3ELP4PNEDANCNFSM4H3RS2ZQ.

--

С уважением, Михаил Новоселов | [email protected] | https://nixtux.ru

mikhailnov avatar Jun 26 '19 21:06 mikhailnov

I've read docs and still don't understand:

* let files A and B have the same hashes (they are completely the same)
* A and B are located on the same subvolume
* can bees deduplicate A and B and so increase free space by the size of that file?

The simple answer is "usually yes, but not always." The "Best Effort" in the name bees reflects that not everything gets deduplicated (or can be deduplicated). The goal for bees is to free space from 95% or more of all duplicate blocks when using the recommended hash table size for the data.

In simple, common cases, all of the space used by B is freed, possibly after some temporary space is allocated. In more complex, rarer cases, the amount of free space may increase by more or less than the size of the file B. In some special cases, total free space could even decrease.

Before we continue, let's say file A is always the file that bees reads first. This is sometimes important, and if we establish that now, we don't have to keep repeating "whichever comes first"...

Here is a comprehensive list reasons why the amount of space freed might not be the same as the size of file B:

  • btrfs deduplicates at the extent level, not the file level. An extent is a contiguous set of blocks forming part of a file. It's possible for file A and B to be stored in shared blocks, or unshared blocks, or a combination of the two. This means that the amount of free space increase on btrfs after dedupe may be any amount from 0 to the full size of A or B. In bees, the size of A is not relevant because bees will always remove extents from file B by replacing them with references to file A, i.e. bees keeps the first match and removes all others.

  • bees processes each btrfs extent individually, so in some cases there are restrictions that apply to only parts of a file, and the rest of the file can be deduplicated and its space freed. In some cases, given that file A and B are identical, the same outcome occurs for every extent in both files; however, in other cases, there may be a different outcome for each pair of extents in the files.

  • It is possible for btrfs extents to contain the same data, but not occupy the same amount of space on disk. e.g. if file B is compressed while file A is not. Currently, bees will keep the first extent it encounters during scan (i.e. the uncompressed extent from file A), and free all others (i.e. the compressed extent in file B). This may cause the amount of space freed to be less than the file size, since the compressed file occupies less space than its size. (This behavior may change in future bees versions to remove the larger version of the data when available).

  • btrfs counts references to extents, not blocks or files, when freeing space. It is possible to have an extent where some of the extent's blocks are referenced by files, but some blocks are no longer referenced by any file. This happens when some blocks in an extent are overwritten by new blocks that did not replace every block in the old extent. e.g. write a 1MB file in a single extent, then later, overwrite 16K in the middle of the file. The result is that the 1MB extent has 16K in the middle that are not accessible because they were replaced by a different 16K extent. The entire 1MB extent will persist until all references to all of its blocks are no longer present anywhere in the filesystem. In theory, the amount of space occupied by a file may be up to 32768 times the size of file B for uncompressed files (128MB maximum extent data size / 4K minimum reference size = 32768) and up to 32 times the size of file B for compressed files (128KB maximum compressed extent size). In practice, a ratio from 1.5x to 4x is most common when files have unreachable blocks in their extents. This happens in files that experience a mix of sequential and random writes, or random writes and explicit defragmentation. This can result in much more free space than expected to be freed when a file is removed from the filesystem (including dedupe).

  • It is possible for two files to contain the same data, but not have the same extent layout, e.g. file A is 2x512K extents while file B is 5x200K extents plus a 24K extent at the end. In the case where the entire files are identical, bees will simply replace B's extents with A's extents and free all space in B in a single pass; however, if there are differences between a few individual blocks, then bees will make copies of extents of file B in a temporary file C, then replace B with references to A and C that are aligned on the correct extent boundaries to enable btrfs to free the extent. This requires some temporary space and multiple passes over the filesystem to complete. All the temporary space is freed by the end of the second pass (possibly earlier), and there will be a net space saving equivalent to the number of blocks of file B that were replaced with references to file A.

  • If file B is a reflink copy of file A, they already occupy the same space, and no further space saving is possible. bees will not try.

  • If file A and file B is less than 2KB (or the value of the max_inline mount option), then btrfs will store the entire file in a metadata item instead of a data block. Metadata items cannot be reflinked, so bees will ignore them. No space is freed in this case, and both files remain physically distinct.

  • If file B has multiple reflinks to files other than A, then no space from B will be freed until the last reflink to B is deduped. bees eventually processes all data in the filesystem, so eventually bees will try to free all the space in B--unless the btrfs send workaround is used, then it is up to the user to delete the last referencing read-only subvol to free the space.

  • If the subvol containing file A or B has snapshot subvolumes elsewhere on the filesystem, no space will be freed until bees processes file B in the last referencing snapshot. Each snapshot will require additional non-shared metadata space to perform the dedupe operation, so the amount of free metadata space will decrease while the amount of free data space increases. This can result in a gotcha on btrfs filesystems that are close to being out of metadata space, as they may run out of space during dedupe.

  • btrfs has limitations on the number of references to a single extent that can be mapped from the ioctl API used by bees. If there are more than about 700,000 references to a shared extent (2730 references in kernel 4.14 and earlier), bees will not add or remove any further new references to these extents. If extents in file A or file B have too many references already, these extents will not be modified and the amount of total space freed depends on how much of the files consist of such extents.

  • btrfs has rare performance bugs when handling extents with unknown characteristics ("toxic extents"). About 0.001% of extents require millions of times more CPU work in the kernel to process compared to the average extent. The reason for this is not known (I've investigated it a few times over the years without success). bees measures kernel CPU performance when deduping extents to detect when this bug occurs. If too much kernel CPU is used (currently 100ms), bees will blacklist the extent and its contents to avoid contributing to further performance degradation. If file A contains any such extent, duplicate copies of those extents will not be freed in file B, nor any other file in the filesystem. The amount of space freed will depend on how many extents in file B are blacklisted. This is normally a rare event, but large files with many duplicate blocks (e.g. VM images) can trigger the bug more frequently.

  • If bees encounters too many copies of the same block contents in the same extent (e.g. long runs of 0xff which are common in raw disk images), the current algorithm runs too long as it attempts to evaluate all possible extent alignments (it's O(N^2), but N = 32768, and each iteration of the loop isn't cheap). This takes far too long to run to be useful. In current bees code, this situation is detected and an exception thrown to abandon processing the extent. If file A and file B each contain many contiguous copies of a single block of data, they may not be fully deduplicated. Note that if this workaround is removed, the extent still cannot be fully deduplicated because it will typically become toxic or exceed the reference count limits. (Future versions of bees may correct this.)

  • If file A and B contain only zero bytes, then all the space of both files will be freed. bees replaces all blocks that contain only zero bytes with holes, i.e. the extent is not deduplicated with anything, it is deleted entirely instead (the kernel fills in the missing bytes with memset when the file is read later). All preallocated extents (from fallocate) are deleted in similar fashion. This sometimes requires temporary space if the zero blocks are located in extents that contain non-zero data, i.e. the non-zero data has to be copied to a new extent before btrfs can remove the original extent.

  • bees uses 64-bit hashes to reduce hash table memory requirements. A cryptographically robust hash is not required because bees and the btrfs kernel code both perform a bit-for-bit compare on all data during the final dedupe step, so the hash only needs to be long enough for an acceptably low collision rate. It is possible that non-duplicate blocks will have identical hashes. bees can handle up to 256 distinct data blocks with any one hash value. After that limit is exceeded, only the most recently scanned 256 blocks with that hash value can be deduplicated. If file A and B consist of more than 256 different blocks with the same hash value, some blocks may not be deduplicated and some space may not be freed. (There are some additional file layout constraints that must be met before this can happen, but it's a corner case and this document is already too long).

  • btrfs does not allow extents to be shared between file A and file B if file A is nodatasum and file B is datasum (or vice versa). btrfs requires both files to be datacow or both to be nodatacow before allowing dedupe. bees only deduplicates datacow files (i.e. the default on btrfs) and ignores all nodatacow files. If a nodatacow file is deduped, the shared extents automatically switch back to datacow, which defeats the purpose of nodatacow.

  • bees will never allocate additional hash table memory if the hash table runs out of space. bees dedupe success rates degrade proportionally if the file system grows too large for the hash table. Old hashes and data block addresses are evicted from the hash table in LRU order with random perturbation. Duplicate hashes are kept longer in the table than unique data hashes. In cases where there is other data between files A and B in the order bees reads them, some or all of the hashes of file A may be evicted from the hash table while reading the data between A and B, and bees will not be able to match file A blocks to file B later on. Some or all of the extents of file B will not be released to free space as a result.

Combinations of the above may occur in btrfs. A file may be compressed, reducing its size by 50%, but may also have a 2x expansion due to unreachable blocks, doubling its size, so the amount of space used is the same as the file size even though the file is compressed. Shared extents may have references from snapshots and dedupe, so each snapshot multiplies the number of references by the number of reflinks inside the subvol.

Zygo avatar Jun 27 '19 02:06 Zygo

One more:

  • btrfs balance, device delete, and shrinking resize operations (collectively, "relocation operations") relocate extents without updating their generation number (aka transid). bees detects new extents in the filesystem using the extent's transid, so any extent with an old transid will not be scanned again even if it appears in a new location. If bees scans file A, then a balance relocates file A, bees will use an outdated location for file A when matching blocks in file B are encountered, and dedupe will fail. This case is treated the same as if the data had been overwritten or deleted, and the corresponding hash table entries will be removed to make room for hash table entries that bees can still find.

Zygo avatar Jun 29 '19 01:06 Zygo

If you have two big A and B files, with a slight difference, bees can still dedupe most of the blocks. It's because bees works at the blocks/extents level.

Does this work for files in the virtual disk image file qcow2? Dedupe should not work under the nocow mount parameters?

daiaji avatar Nov 01 '19 14:11 daiaji

@daiaji If the file was created with "btrfs nocow attribute" (see lsattr), it won't be touched by bees. That has nothing to do with qemu using a cow format internally for its image file command. Actually, you probably don't want to use qcow2 on btrfs because it's essentially cow on cow. That means, either you'll create the file chattr +C (nocow) or you don't use qcow2 but a raw disk format (you may use it sparse).

kakra avatar Nov 01 '19 14:11 kakra

If the qcow file contains 4K aligned data blocks and the file is not marked nodatasum, bees can dedupe the blocks. If you do e.g. a dist-upgrade in two different VM images, identical files will be written to filesystems in both VM images, and bees will typically dedupe those. bees is not likely to dedupe 100% of them due to various limitations and bug workarounds, but the dedupe hit rate won't be 0% either. The workarounds exist to avoid severe performance bugs in btrfs that can be triggered when every possible duplicate block on the filesystem is removed.

The datasum attribute is the one that determines whether a file can be deduped in bees, but the datacow attribute is the one that users can see and control. lsattr reports nodatacow as C but does not report nodatasum at all. Internally, btrfs decides whether dedupe is allowed by looking only at nodatasum. When nodatacow is set, nodatasum is always also set.

nodatacow is a per-file attribute (if it is set in a directory, files created in the directory inherit the attribute). The mount option nodatacow sets the default attribute value for newly created files while the filesystem is mounted. If you had files in the filesystem before changing the mount option, the existing files keep their old datacow settings (on or off). Same for nodatasum, except the per-file nodatasum attribute is not accessible to users.

qcow2 is required for qemu features like VM snapshots, so in some cases you have to run qcow2 even on btrfs. If you don't need those features in qemu, then raw image on btrfs with datacow usually performs better than qcow2, and the other btrfs features like csums, compression, dedupe, snapshots work normally with a raw VM image file.

Note that when you make a btrfs snapshot or a reflink copy of a nodatacow file, each individual extent of the file becomes datacow as long as shared references exist. This removes many of the performance gains of nodatacow, and the effect is persistent (i.e. you have to do a full defrag on the file to get the original performance back).

Zygo avatar Nov 01 '19 15:11 Zygo

I understand, it seems that even if I use the NTFS file system on the virtual disk image file, dedupe can hit? Because I am using UEFI, I can't use qemu snapshots, so should I use a raw virtual disk image file?

sudo btrfs filesystem usage /home
Overall:
    Device size:                   3.49TiB
    Device allocated:              1.22TiB
    Device unallocated:            2.27TiB
    Device missing:                  0.00B
    Used:                        972.49GiB
    Free (estimated):              2.54TiB      (min: 2.54TiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:1.21TiB, Used:969.53GiB
   /dev/nvme0n1p1          1.21TiB

Metadata,single: Size:5.01GiB, Used:2.95GiB
   /dev/nvme0n1p1          5.01GiB

System,single: Size:4.00MiB, Used:160.00KiB
   /dev/nvme0n1p1          4.00MiB

Unallocated:
   /dev/nvme0n1p1          2.27TiB

qemu-img convert -f qcow2 -O raw ~/.vm/Win10.qcow2 ~/.vm/Win10.img

rm -rf ~/.vm/Win10.qcow2
   
sudo btrfs filesystem usage /home
Overall:
    Device size:                   3.49TiB
    Device allocated:              1.07TiB
    Device unallocated:            2.42TiB
    Device missing:                  0.00B
    Used:                        971.70GiB
    Free (estimated):              2.54TiB      (min: 2.54TiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 80.00KiB)

Data,single: Size:1.06TiB, Used:968.77GiB
   /dev/nvme0n1p1          1.06TiB

Metadata,single: Size:5.01GiB, Used:2.94GiB
   /dev/nvme0n1p1          5.01GiB

System,single: Size:4.00MiB, Used:160.00KiB
   /dev/nvme0n1p1          4.00MiB

Unallocated:
   /dev/nvme0n1p1          2.42TiB

It seems that qcow2 and img's dedupe hit rate is similar.

Data,single: Size:1.06TiB, Used:968.77GiB /dev/nvme0n1p1 1.06TiB

However, it seems to point out that img's dedupe hit rate is slightly higher.🥰

But at least RAW virtual disk performance will be better. And from the storage usage point of view, even with the RAW virtual disk image, the storage usage is flexible, I can even take a QEMU UEFI snapshot! Bees is awesome!

daiaji avatar Nov 01 '19 16:11 daiaji

@Zygo Off-topic, does btrfs' copy-on-write copy reduce the performance of continuous file reads and writes or degrade the performance of random files?

daiaji avatar Nov 01 '19 17:11 daiaji

Copy-on-write allows all writes to be continuous--since every write relocates data, all writes can be relocated to contiguous areas, even if the writes themselves are randomly ordered.

If a file is written randomly, then later sequential reads will be slower. The sequential logical order of the reads will not match the random physical order of data on the disk.

If a file is written continuously, then later sequential reads will be faster. This is how the btrfs 'defrag' feature works--it simply copies fragmented data into a contiguous area in order, so that future sequential reads are in logical and physical order at the same time.

If a file is read continuously, then performance will be proportional to the size of each non-consecutive fragment. There will be one seek per fragment, plus another seek to read a new metadata block on every ~100th fragment. On SSDs the seeks are replaced with IO transaction overheads, which are almost as expensive as physical head movements on SATA SSD devices.

If a file is read randomly (e.g. hash table lookups), then performance will be close to the worst-case rate all the time.

Data extent fragmentation makes random read performance a little worse, but metadata pages usually fit in RAM cache, so once the cache is hot, only the data block reads contribute significantly to IO load. If you have a really large file and the metadata pages don't fit in RAM cache, then you'll take a metadata page read hit for every data block, and on a fast SSD that can be a 80% performance loss (one 16K metadata page random read for each 4K data page random read). Slow disks only have a 50% performance loss (the seek time dominates, so the 16K random read cost is equivalent to the 4K one).

Double the RAM cache costs and/or performance losses from fragmentation if csums are used (each read needs another O(log(n)) metadata page lookup for the csum).

Zygo avatar Nov 01 '19 18:11 Zygo

On SSDs the seeks are replaced with IO transaction overheads, which are almost as expensive as physical head movements on SATA SSD devices.

Wait, what? This sounds much worse than I always thought. I knew that seeks on SSDs aren't free (i.e., 4k random reads in benchmarks always show how slow it is) but I thought this was limited by the SSD IOPS limit only, and not that there's a huge seek penalty.

kakra avatar Nov 01 '19 19:11 kakra

@Zygo Another digression, dedupe can reduce the write amplification of ssd? The block being dedupe has never been actually written to ssd? As we all know, the btrfs of the cow itself will lead to a certain degree of ssd write amplification, so I do not want dedupe to increase the write amplification of ssd.

daiaji avatar Nov 01 '19 19:11 daiaji

On SSDs the seeks are replaced with IO transaction overheads, which are almost as expensive as physical head movements on SATA SSD devices.

Wait, what? This sounds much worse than I always thought. I knew that seeks on SSDs aren't free (i.e., 4k random reads in benchmarks always show how slow it is) but I thought this was limited by the SSD IOPS limit only, and not that there's a huge seek penalty.

Sounds like an extra reading load.

daiaji avatar Nov 01 '19 19:11 daiaji

Another digression, dedupe can reduce the write amplification of ssd? The block being dedupe has never been actually written to ssd?

That's not how bees works: Bees is always on the after-thought: Data A is written, data B is written, some seconds later, bees picks up, reads A and B, compares, and writes data C (which is the blocks that A and B have in common). So the math is more like 3x the data. But if you're lucky, bees picks up data fast enough that the kernel spares actually writing A and B to disk. But I don't think that is likely to happen. @Zygo may know more...

kakra avatar Nov 01 '19 19:11 kakra

The SATA interface protocol imposes a large latency penalty on short transactions compared to other interfaces like NVME, so very small fragments affect SATA SSDs more than NVME ones. The effect is usually reflected in the table of throughput vs IOPS for the drive.

Currently all btrfs dedupe (including bees) is done after writes to the disk are completed, i.e. the data is first written to disk, then replaced with a reflink and removed by a separate commit later. bees will leave the data alone for several btrfs transactions to avoid unnecessarily deduping files that are frequently overwritten, but it cannot prevent the duplicate writes from occurring in the first place.

The only way to have dedupe decrease writes on btrfs is to place dedupe somewhere on the kernel's data write path, i.e. do the dedupe before and instead of writing the data to disk. There are patches floating around to do this but last time I checked they were not ready for merge yet.

The situation for FTL write amplifications inside the SSD is much less clear. There is always some write amplification on SSDs unless the filesystem is always writing in exactly the minimum number of erase-block-sized writes. If discard is enabled and the minimum amount of disk space is used, then the write load can be spread out over more erase blocks, so dedupe can help prolong drive lifetime even with a net increase in writes from the host.

In practice, cheap SSDs have bad TBW ratings in part because of lower-quality hardware and in part because their firmware doesn't implement good FTL algorithms. So you start with short-lived hardware and wear it out more quickly with bad firmware.

Zygo avatar Nov 01 '19 19:11 Zygo

@Zygo It sounds that discarding duplicate blocks directly can even reduce the generation of data fragments. After the first dedupe, there will be only non-repeating data blocks on the disk, and no more duplicate data blocks will be written. Only reflink is increasing, and reflink data defragmentation seems to appear. Simpler than the actual data block! Can reflink and non-repeating data be handed over to btrfs for native data defragmentation? Can duplicate block abandonment occur before I/O? I am afraid that any actual I/O will make the efficiency lower, even if it is writing memory, because "CPU is always faster than I/O." It sounds crazy, but the performance gap may be significant in I/O-intensive scenarios. Can even process files smaller than 4k? At least it reduces the writing of small files, but it is doubtful whether writing reflink is better than the actual small file.

daiaji avatar Nov 01 '19 21:11 daiaji

@daiaji It's still not how bees works: Duplicate data is still written, bees will read it later and find it could be shared. There's no magic algorithm which would say "hey we have this nifty reflink here and it matches data we want to write, let's just increase the reflink" - that's not how it works. And the "just" is not as simple as it sounds.

What you want is in-band dedup which detects duplicate blocks before writing them out to disk. I think ZFS can do that (and needs huge amounts of RAM for this), and as @Zygo already pointed out, there are similar patches for btrfs but their stability is unknown, and these patches also trade RAM/additional IO for dedup detection. It's probably slow if you're not expecting a high dedup rate on writes. Bees is not designed to work that way so it would probably never be an option.

kakra avatar Nov 01 '19 22:11 kakra

@kakra I know the in-band dedup! But I don't know how that works, but now I know how he works, at least some memory is needed as a staging area.

daiaji avatar Nov 01 '19 22:11 daiaji

@daiaji BTW: If highly redundant data is your primary use-case, you may have better results with content-addressed storage solutions: Git does this, borgbackup does this, ipfs does this, but their primary concern is different: These are more like archive solutions.

casync (and maybe desync) also does this, and can leverage btrfs capabilities when used as the backend storage: It is actually some sort of rsync but instead of feeding each data block from remote, only non-duplicates compared against a local storage would be transferred and cached there (with seed support), data is then written as reflink copies into the cache.

I.e., I'm storing around 100 full backups of my btrfs in a borg archive, that's 100 x 2.7 TB compressed down to around 3.4 TB, that's a history of 100 different days of backups, each one can be accessed/restored individually and completely. Creating a full backup of my btrfs takes around 20-30 minutes with this.

kakra avatar Nov 01 '19 22:11 kakra

Data A is written, data B is written, some seconds later, bees picks up, reads A and B, compares, and writes data C (which is the blocks that A and B have in common).

That's the opposite of what happens. If bees can simply delete B without copying anything, it does; otherwise, bees creates a copy of the unique data in B, then uses a combination of the partial copy of B and the original A to remove B.

In btrfs, extents are immutable. To illustrate this, I'll use phrases, words, and letters to represent files, extents, and 4K blocks. Suppose we have two files with two extents each, "NATIONAL EMERGENCY" and "EMERGENT NATION". The btrfs rules are:

  • Only entire words (extents) can be created or destroyed. e.g. we cannot shorten "NATIONAL" to "NAT" or "ION", we can only create a new word (extent), copy the blocks we want to keep, then delete the entire word (extent) "NATIONAL".
  • We can create references to parts of the word (extent), e.g. everywhere we find "NATION" or "AT", we can make a reference to some but not all the blocks in "NATIONAL". We show this as capital letters for the referenced blocks and lowercase for the unreferenced blocks, i.e. "NATIONal" or "nATional". The physical extent "NATIONAL" is contiguous, but references to blocks within that extent can be inserted into any point in the filesystem.

So bees does something like this:

  • In "NATIONAL" the letter N appears twice, but we don't dedupe within an extent because compression is usually better than dedupe in this case (zstd and zlib compression can compress identical blocks in an extent, lzo does not). Same for the duplicate "A".
  • In "NATION", bees matches the "N" that is common to both "NATION" and "NATIONAL", then reads the surrounding blocks and notices that "ATION" also matches. bees can replace "NATION" with a reference to part of "NATIONal". The physical extent "NATIONAL" still exists, but the "NATIONal" logical extent in the second file is now a reference to the first 6 blocks of "NATIONAL". The original "NATION" extent has no further references so it is deleted.
  • In "EMERGENCY" the letter (block) "N" appears again, but the adjacent letters (blocks) don't match, so bees creates a temporary file containing "EMERGE CY", then replaces "EMERGENCY" with "EMERGE National CY". The original "EMERGENCY" now has no references so btrfs deletes it.
  • In "EMERGENT" bees finds matches for the freshly-created "EMERGE" and "N" in the "NATIONAL" extent. bees also finds the "T" in "NATIONAL"; however, since we already have a reference to one of the "N" blocks in "NATIONAL", bees does not use the "T". We don't reference the same extent twice in a single dedupe operation to avoid btrfs performance bugs, so we create a temporary "T" extent instead. bees then replaces "EMERGENT" with "EMERGE National T", and btrfs deletes "EMERGENT".
  • The disk now contains these physical extents: "NATIONAL EMERGE CY T". The original two files consist of references to these in the same logical order: "NATIONAL EMERGE National CY" and "EMERGE National T NATIONal". There are some seeks now because these logical extent orders do not match the physical extent order. Unfortunately btrfs (and Linux VFS) do not cache shared extents with shared cache, so a sequential read of both files reads the same physical blocks from "NATIONAL" 4 times.

In reality the current bees algorithm reads blocks in order and does not have cost-based decision-making models. It also can't combine extents and is sensitive to data order. "NATIONAL ION NATION" becomes "NATIONAL natIONal NATIONal" (the optimal result, one physical extent with 3 logical references), but "ION NATION NATIONAL" becomes "ION NAT ION NAT ION AL" (NATION and NATIONAL are discarded, new small fragments "NAT" and "AL" are created, 3 physical extents with 6 references).

Zygo avatar Nov 02 '19 02:11 Zygo

@Zygo So at least discarding duplicate data is a good way to reduce disk fragmentation, right? Because reflink itself is less than 4k, the bees does not process them, and the non-repeated data blocks can also be processed by BTRFS. In fact, the whole process only needs to write the data to the disk once, after which there is no need to do other writes, and no need to delete the repeated data blocks on the disk, so as not to destroy the continuity of data. Avoiding redundant write operations in itself can minimize write magnification.

daiaji avatar Nov 02 '19 03:11 daiaji

If bees encounters too many copies of the same block contents in the same extent (e.g. long runs of 0xff which are common in raw disk images), the current algorithm runs too long as it attempts to evaluate all possible extent alignments (it's O(N^2), but N = 32768, and each iteration of the loop isn't cheap). This takes far too long to run to be useful. In current bees code, this situation is detected and an exception thrown to abandon processing the extent. If file A and file B each contain many contiguous copies of a single block of data, they may not be fully deduplicated. Note that if this workaround is removed, the extent still cannot be fully deduplicated because it will typically become toxic or exceed the reference count limits. (Future versions of bees may correct this.)

@Zygo Sorry to bump an old thread, but I was wondering if your above statement is still the case in the current release of bees.

I ask, because it seems like a good use case for dedupe would be in raw disk images, as day over day, the delta of what has changed is probably pretty small. For example, I take nightly raw images (dd) of various raspberry pis I have and keep a rolling 30 days. Most of them are 16 GB in total, but only about 5-10 GB are used. So that leaves a lot of it as empty space. These typically compress very well (because all of the empty space compresses down). My hope would be that I could take advantage of dedupe, so that really only the net change between two days of raw images would be what was not deduped.

But it sounds like for these types of files, bees will just stop processing them, and move on. Your comment mentions that a future version of bees may correct this. I was wondering if that had been done, or if it is still in the planning phase.

ubenmackin avatar Oct 08 '20 03:10 ubenmackin

If the empty space contains zero, that's a special case. bees dedupes those with holes, which removes the block from the filesystem. There is no need for a duplicate block in that case--the blocks can be deleted entirely.

For compressed extents it's more complicated. Removing a 0 block in the middle of a compressed extent reduces compression efficiency and takes up more space on disk. Removing 0 blocks from either end of a compressed extent is OK, and a compressed extent that consists entirely of zeros will be entirely deleted.

Now we get to non-zero cases. If the empty space contains a frequently repeating non-zero pattern that is only one block long, bees would normally try to dedupe every instance of that block to a single block. In big VM images there can be billions of these, and btrfs will break trying to handle that many backrefs. So bees does a few things to avoid that:

  • kernel thread CPU usage is measured after every operation. If it gets too high (indicating btrfs is encountering a performance bug), the extent is marked "toxic" and abandoned.
  • reference counts are measured. If there are too many to be reported by the LOGICAL_INO ioctl (just under 700,000) then the extent is also abandoned.
  • dedupe between two non-contiguous block pairs in the same extent is not permitted to avoid generating millions of single-block references. e.g. a single extent ABCABCABC cannot be replaced with 3 references to a single instance of ABC; however, if there are two duplicate extents with half of a repeating pattern each, they can deduplicate with a third extent.

The plan for future versions is to switch from a first-match algorithm to a best-match algorithm with cost estimation. It will prefer to group short repeating patterns into larger extents (creating them itself if required) and avoid generating too many too small references. A repeating pattern would be stored once in the maximum possible extent length in order to benefit from compression, then referenced in smaller fragments as required.

There is also a limitation in the current version where a repeating pattern like "ABABABAB" causes too many loops in the matching code and triggers an exception to abort loops with trillions of possible iterations. That's just a bug that will be fixed when the matching algorithm is replaced.

Zygo avatar Oct 08 '20 03:10 Zygo

That said, these protection measures only account for a few percent of VM images. A lot of dedupe still happens.

Zygo avatar Oct 08 '20 03:10 Zygo