bees icon indicating copy to clipboard operation
bees copied to clipboard

Add "dedupe success rate" in documentation

Open marcmerlin opened this issue 4 years ago • 3 comments

I read https://github.com/Zygo/bees/blob/master/docs/how-it-works.md as well as the other pages, but couldn't find anything that simply showed me how much space was saved by deduping data.

I'm assuming it's one of the 100+ numbers in .beeshome/beesstats.txt, but honestly it wasn't obvious which one. Seeing bytes saved and if possible 40% saved, would be stellar.

marcmerlin avatar May 17 '20 01:05 marcmerlin

Currently you have to run compsize before and after to see how much space was saved.

bees does not track the amount of space saved itself. It only tracks the number of bytes it has sent to individual kernel dedupe ioctl calls (this is dedupe_bytes in beesstats). None of the dedupe calls saves any space until the last reference to each extent is processed, and bees does not know when that happens, so it can't report the information. bees also does not count the number of extent references in the filesystem to be processed, so it doesn't have the denominator for the percentage either.

Kernels up to 5.6 can provide the information bees would need, but the lookups are slow (seconds to minutes for frequently appearing data, and you can only safely run one thread at a time) and block all writers to the filesystem while they run.

So far, 5.7 is a lot better (like, hundreds of times better, bugs from 2012 are finally getting fixed in 5.7-rc and the performance improvement for bees is impressive...between crashes). Maybe once that kernel lands we can start moving bees forward again.

Zygo avatar May 19 '20 06:05 Zygo

@Zygo thanks for the answer. I'm indeed on 5.6 and great to hear that 5.7 is going to make things better. In the past I've used something like 'hardlink.py -c -f -x options.txt -x Makefile -x album_icon.jpg -x .directory -x Entries -x Repository -x init.pyc /dir" and it gives a very nice report of how much it was able to save via hardlinks.

I understand though that it's not as easy here, as you just explained.

As for compsize, I hadn't used it before. Is it better than looking at numbers from btrfs fi show? It'd be great to add a bit to your documentation to explain how to interpret it compared to fi show.

sauron:~# compsize /mnt/btrfs_boot/
Processed 33434320 files, 1527522 regular extents (27821219 refs), 14135568 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       80%       94G         116G         2.0T       
none       100%       69G          69G         1.3T       
lzo         51%       24G          47G         715G   

Either way, any tips you can put in the documentation on how to see how well bees worked, would give people satsfaction of knowing how much space they saved thanks to you ;)

marcmerlin avatar May 19 '20 15:05 marcmerlin

The df 'used' column reports the total space usage across the filesystem. btrfs fi show gives that information broken down by drive. For dedupe purposes, simply comparing the df 'used' before and after is sufficient. The df 'avail' column also counts metadata, and btrfs can allocate a GB chunk of metadata at any time, so it messes up dedupe scoring.

compsize gives disk usage (the number of bytes occupied on the disk for the data extents), uncompressed (how much data those bytes hold after decompression if any), and referenced (the total bytes after decompress and dedupe, or roughly how big an uncompressed tar file containing the data would be). compsize doesn't report on metadata space usage, which...changes (csums are deleted, but there are more reflink items, either can be larger than the other). compsize can also be confused by various things (like hardlinks, and file extents that are shared with files that compsize did not inspect).

any tips you can put in the documentation on how to see how well bees worked

Also...bees doesn't work very well. It tends to gain 70% at the same time it loses 30% on some test corpus. There's still a net 40% gain, but it should be 70%. On some pathological workloads, the losses exceed the gains, and space is lost. It's a known issue--the lack of data about how much space is saved is in part due to not having necessary information to save space without occasionally incurring avoidable losses. I know how to fix it, but it does mean starting over with big chunks of bees code. To be fixed in some future version...

Zygo avatar Sep 15 '20 18:09 Zygo