bees RFC: Proposal to add alternative hashing function

trafficstars

I was noticing that my bees process was using 100% of 4 cores on my cpu, which when running 24/7 must use quite a bit of electricity and generate heat. A quick perf top showed that the crc64 was the main culprit. I've already optimised the crc64 function as much as I can, so I started wondering if there was an SIMD optimised version or an alternate hash function I could use. I came across xxHash and it seems to be ideal, especially for 64 bit cpus (Which I suspect most will be if running bees). It is ~10x faster and has less collisions. A quick mod to https://github.com/PeeJay/smhasher gives the following results:

x32 :
crc64    - 0.276 bytes/cycle -   790.67 MiB/sec @ 3 ghz
xxhash64 - 0.660 bytes/cycle - 1,888.16 MiB/sec @ 3 ghz

x64 :
crc64    - 0.563 bytes/cycle -  1,611.70 MiB/sec @ 3 ghz
xxhash64 - 4.120 bytes/cycle - 11,787.61 MiB/sec @ 3 ghz

I hacked it into bees at https://github.com/PeeJay/bees/tree/hash and have been using it for the last few days. To say it now screams along would still be an understatement.

Feb 26 '19 11:02 PeeJay

I noticed a similar improvement when experimenting with CityHash. For some reason xxhash scored lower than cityhash for me, but either is much faster than crc64.

I have a larger change in progress to support https://github.com/Zygo/bees/issues/31#issue-269257801 but the change is really big (it has to support multiple hashes and block sizes in order to use btrfs crc32c as a block hash, so it's tangled up with a rewrite of the scanning code) and stalled due to lack of time to work on it.

Can you turn this into a PR, and make the hash function runtime selectable (i.e. --hash=crc64 or --hash=xxhash)?

Also note that there is an optimization in BeesBlockData::is_data_zero which assumes the hash of all-zero data is 0. That doesn't work for other hashes, which produce different values depending on the length of the all-zero data (i.e. you can't just precompute the all-zero hash value, you need a lookup table with length as an index). The "optimization" can be removed since its benefit is questionable--memcmp with 0 is faster anyway.

Feb 27 '19 16:02 Zygo

I missed checking CityHash64, it's faster for me too at around 18 GB/s on 64bit and 2.5 GB/s on 32 bit. I'll use that instead. I also think I should write the hash algo into the hash table so once the table is created it can't accidently be used with the wrong algo.

Feb 28 '19 01:02 PeeJay

I also think I should write the hash algo into the hash table so once the table is created it can't accidently be used with the wrong algo.

To do it right, you do a hash with the current algo, and if that fails to find a match, try lookup again with previously used algo(s), and if that doesn't work, insert the hash from the current algo. That allows the old data to continue to be used without rescanning the entire filesystem. At some point the old data isn't in the hash table any more, and we can stop looking up with the old algorithms. It gets complicated to do it right.

For a quick change that can be used in the interim, it's sufficient to just set the hash function at startup. Any data in the hash table that uses a different algorithm won't match at any significant rate, and the new hashed data will push out the old data over time. The user will have to reset beescrawl.dat and start over with a new full-filesystem scan using the new algorithm.

If we want to put it in beeshome and have bees do a bunch of things when it changes, then it overlaps with other hash table format changes I want to do (e.g. different bucket sizes, hash lengths to save space, resize the hash table).

Feb 28 '19 05:02 Zygo

I was thinking the hash algo can be set on first run (defaulting to crc64 for now) and subsequent runs can just set it based on what the hash table uses. We can revisit after you do your changes.

ps. do you mind if I upgrade the build system to use cmake? It would be much simpler than using makefiles.

Mar 01 '19 10:03 PeeJay

@PeeJay Why not meson/ninja? It seems even simpler from what I've seen, plus it seems to have better dependency tracking.

Mar 01 '19 21:03 kakra

I'm not opposed to replacing makefiles (FWIW I think meson/ninja might be saner than cmake), but it's off-topic for this issue. Please create a new one.

Mar 01 '19 23:03 Zygo

Am I right in understanding that bees:

Uses fixed-sized (after initial size is set) block-based dedupe
Would not dedupe data stored as both a standalone file and in a tar because the tar header offset would not make any blocks match

Aug 14 '19 13:08 HaleTom

@HaleTom Bees dedupe offsets can only be as fine-grained as the block size so it's unlikely to match anything inside and outside a tar file unless your tar headers are block size.

But I wonder if a rolling hash algorithm could improve the hit rate (Rabin-Karp or something like that). It won't help with non-aligned offsets, tho. Rabin-Karp (and similar rolling hash implementations) are mostly alignment-agnostic so it couldn't be used directly with bees.

I'm pretty sure @Zygo already has investigated that and comes up with a good answer. :-)

If you want to dedupe/archive data using rolling hashes, I'd rather recommend looking into borg-backup (or similar solutions): It achieves a much higher deduplication rate than archiving tar files on btrfs could ever do.

Aug 14 '19 14:08 kakra

BTW: Thinking about it and using variable chunking (instead of static/block-based chunking) should enable bees to work around boundary shifting problems as introduced by tar files you mentioned. It should be possible to create an extent just as big as the tar header, this would shift the following data back to an extent boundary suitable for proper dedupe. I'm not sure how easy such a thing would be to implement, tho. And it will create slack space by extents not filling a FS block fully. With a lot of those, you may waste more space than you'd save.

Aug 14 '19 14:08 kakra

Uses fixed-sized (after initial size is set) block-based dedupe

Other dedupe agents use a fixed block size with a variable hash table size. bees uses a fixed hash table size with a variable effective block size.

The hash block size is always 4K, but bees will not store every single 4K block in the hash table, only a sample of them. When a matching pair of blocks is found, bees searches adjacent blocks in the files for more matching data. So if you have two copies of a 1MB file, bees only needs a hash table entry for one 4K block out of the 1MB to dedupe the entire file, and the other 255 hash table entries are not necessary. This allows bees to discard hash table entries without adversely affecting dedupe hit rates. When the fixed-size hash table fills up, bees discards the least useful 4K block hashes (preferring new hashes over old ones, duplicate hashes over unique ones, with random sampling to break ties).

The "effective" block size is the average distance between the 4K blocks stored in the bees hash table. In other words, it is the amount of data that the 4K block hash represents--the 4K block itself, and its neighbors whose entries were removed from the hash table. This concept is only useful when trying to compare bees to things like ZFS or duperemove which use a fixed block size and don't delete hash table entries, so a 128K block hash always represents exactly 128K of data aligned to a 128K boundary. Even when the effective block sizes are equal, bees gets a higher dedupe hit rate for a given block size because the bees alignment constraint is always 4K no matter how large the effective block size is.

Would not dedupe data stored as both a standalone file and in a tar because the tar header offset would not make any blocks match

(uncompressed) tar files are block aligned. Tape drives use specific block sizes in hardware, and tar was originally intended to write to them directly; however, the default tar block size is 512 bytes, not 4K as in btrfs. Still, there's a 1 in 8 chance that a file's tar header lines up to a 4K block at random, and if the file is also 4K or larger, bees can dedupe it (maybe not the EOF block, but all the other blocks). Same for ISO9660 CD/DVD images, where the block size is 2K and there's a 50% chance of 4K alignment. You can also tell tar to write files with a 4K block size.

bees sometimes gets a dedupe hit in email attachments in mbox files. tar files are easy. ;)

But I wonder if a rolling hash algorithm could improve the hit rate

There's no point if btrfs doesn't support unaligned dedupe. Aligned hashes are much easier and there are various shortcuts and optimizations possible that can't be done without the alignment constraint. If anyone implements unaligned dedupe in btrfs, I'll consider it, but not before. ;)

One thing that can improve the hit rate is to select hashes to discard based on how unique they are. e.g. in the 1MB duplicate file example, maybe there's a 4K block like a header that appears in every file, and another 4K block that occurs only in completely identical files (this is fairly common for media files like JPEGs where there's a big data table at the start that is constant for all images taken by the camera, and then unique image data after that). We want to discard the 4K blocks that are part of shorter matches in favor of 4K blocks that are part of longer matches. I haven't found a good way to do that, though, and the benefits seem to be very small.

Aug 14 '19 16:08 Zygo

You can also tell tar to write files with a 4K block size.

Or...not? Maybe I'm thinking of cpio for the block size parameter and tar for the padding alignment.

Aug 14 '19 16:08 Zygo

bees bees copied to clipboard

RFC: Proposal to add alternative hashing function

bees
bees copied to clipboard