bees
bees copied to clipboard
Add "dedupe success rate" in documentation
I read https://github.com/Zygo/bees/blob/master/docs/how-it-works.md as well as the other pages, but couldn't find anything that simply showed me how much space was saved by deduping data.
I'm assuming it's one of the 100+ numbers in .beeshome/beesstats.txt, but honestly it wasn't obvious which one. Seeing bytes saved and if possible 40% saved, would be stellar.
Currently you have to run compsize
before and after to see how much space was saved.
bees does not track the amount of space saved itself. It only tracks the number of bytes it has sent to individual kernel dedupe ioctl calls (this is dedupe_bytes
in beesstats
). None of the dedupe calls saves any space until the last reference to each extent is processed, and bees does not know when that happens, so it can't report the information. bees also does not count the number of extent references in the filesystem to be processed, so it doesn't have the denominator for the percentage either.
Kernels up to 5.6 can provide the information bees would need, but the lookups are slow (seconds to minutes for frequently appearing data, and you can only safely run one thread at a time) and block all writers to the filesystem while they run.
So far, 5.7 is a lot better (like, hundreds of times better, bugs from 2012 are finally getting fixed in 5.7-rc and the performance improvement for bees is impressive...between crashes). Maybe once that kernel lands we can start moving bees forward again.
@Zygo thanks for the answer. I'm indeed on 5.6 and great to hear that 5.7 is going to make things better. In the past I've used something like 'hardlink.py -c -f -x options.txt -x Makefile -x album_icon.jpg -x .directory -x Entries -x Repository -x init.pyc /dir" and it gives a very nice report of how much it was able to save via hardlinks.
I understand though that it's not as easy here, as you just explained.
As for compsize, I hadn't used it before. Is it better than looking at numbers from btrfs fi show? It'd be great to add a bit to your documentation to explain how to interpret it compared to fi show.
sauron:~# compsize /mnt/btrfs_boot/
Processed 33434320 files, 1527522 regular extents (27821219 refs), 14135568 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 80% 94G 116G 2.0T
none 100% 69G 69G 1.3T
lzo 51% 24G 47G 715G
Either way, any tips you can put in the documentation on how to see how well bees worked, would give people satsfaction of knowing how much space they saved thanks to you ;)
The df 'used' column reports the total space usage across the filesystem. btrfs fi show
gives that information broken down by drive. For dedupe purposes, simply comparing the df 'used' before and after is sufficient. The df 'avail' column also counts metadata, and btrfs can allocate a GB chunk of metadata at any time, so it messes up dedupe scoring.
compsize
gives disk usage (the number of bytes occupied on the disk for the data extents), uncompressed (how much data those bytes hold after decompression if any), and referenced (the total bytes after decompress and dedupe, or roughly how big an uncompressed tar file containing the data would be). compsize
doesn't report on metadata space usage, which...changes (csums are deleted, but there are more reflink items, either can be larger than the other). compsize
can also be confused by various things (like hardlinks, and file extents that are shared with files that compsize
did not inspect).
any tips you can put in the documentation on how to see how well bees worked
Also...bees doesn't work very well. It tends to gain 70% at the same time it loses 30% on some test corpus. There's still a net 40% gain, but it should be 70%. On some pathological workloads, the losses exceed the gains, and space is lost. It's a known issue--the lack of data about how much space is saved is in part due to not having necessary information to save space without occasionally incurring avoidable losses. I know how to fix it, but it does mean starting over with big chunks of bees code. To be fixed in some future version...