borg borg2: implement new chunker?

#8803 opens the door for a unique opportunity: re-chunking while doing a borg2 transfer (which will be required anyway for transferring archives from borg1 repos to borg2 repos).

So, if borg2 gets a new chunker before it is released, we could use it there and convert relatively painlessly.

Usually one can not easily switch to a new chunker within an existing repo:

new-chunked chunks of identical files do not deduplicate with old-chunked ones
thus, space usage doubles as long as old-chunked archives are present (== for a very long time in usual pruning scenarios)

Requirements for new chunker:

little to no C code, rather Cython, Python. (*)
better security properties than buzhash, see https://github.com/borgbackup/borg/wiki/CDC-issues-reported-2025
not too slow, preferably similarly fast or faster than buzhash
better to maintain code (buzhash is too much C)
could be a separate project (like borghash, borgstore, now borgchunk(er)?)

(*) in the borg codebase. nothing against a well-maintained chunker library with more low-level code that is external.

Existing chunkers in borg:

fixed (fixed block size, relatively simple, fast, Python/Cython, can support sparse files efficiently)
buzhash (variable block size, CDC, complex, hard to maintain C code, no sparse files support)

Chunker tickets:

https://github.com/borgbackup/borg/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22c%3A%20chunker%22
PR #5070

May 16 '25 22:05 ThomasWaldmann

Speaking of performance impact. My old machine with Core i7-6600U and aes-ni support claims to compute:

405 MiB/s of SHA256 per openssl speed -evp SHA256 -bytes 262144
190 MiB/s of SHA256 per sha256sum /dev/shm/urandom.2GiB
~~77 M blocks/s of AES-128 per openssl speed -evp AES-128-CBC -bytes 262144~~
321 M block/s of AES-128 per openssl speed -evp AES-128-OCB -bytes 262144
224 MiB/s of SipHash12_1u64(Gear64(Last-level-cache))
1,344 MiB/s of Buzhash(Last-level-cache)

While SipHash12(Gear64()) is ≈6 times slower than Buzhash it's still ≈3 times faster than AES(), so it might still be an option to consider as a fingerprinting-resisting chunker in terms of CDC-issues-reported-2025.

Also, SipHash-1-2 has imperfect avalanche properties: SMHasher reports maximal pair-wise bit-bias of 25%, but that's probably good enough given quasi-randomized input.

May 22 '25 09:05 darkk

For borg2, we'll use OCB mode, as that is AEAD and very fast:

tw@MacBook ~ % openssl speed -evp AES-256-OCB -bytes 262144
Doing AES-256-OCB ops for 3s on 262144 size blocks: 33707 AES-256-OCB ops in 3.00s
version: 3.5.0
The 'numbers' are in 1000s of bytes per second processed.
type         262144 bytes
AES-256-OCB    2945362.60k

So, almost 3GB/s?

May 22 '25 17:05 ThomasWaldmann

Please, excuse my habit of high-context messages :-) I was speaking assuming the context of chunker parameter extraction attacks. BTW, Breaking and fixing content-defined chunking paper is now available online, so wiki page might be updated.

Indeed, AES-128-OCB does 321 M block/s and AES-256-OCB produces 234 M block/s on the very same machine. However, I think that you mean AES-256-OCB being used for data encryption & authentication, not as a part of chunker.

I'm counting AES blocks (not bytes) as a throughput metric because permutation should be done for the hash value of each and every byte of input that is chunked.

My bad, AES-128-CBC was a wrong way to estimate number of of AES blocks per second. I assume, that happens due to CDC enforcing data dependency, thus breaking some sort pipelining.

It's interesting to see that quite fast Buzhash-based chunker might still be a limiting factor given 3.5 GiB/s of encryption throughput for AES-256-OCB and 4.8 GiB/s for AES-128-OCB.

May 24 '25 14:05 darkk

Good news: in #8881 I started to refactor the chunkers and separated file reading, sparse file / fmap support from the chunking.

That shrinked ChunkerFixed to be trivial, we can differentiate now between chunking and reading/seeking time for benchmarking and hopefully FileReader will be also useful to simplify the buzhash Chunker (and add sparse file / fmap support at the same time).

May 25 '25 00:05 ThomasWaldmann

Chunker refactor is merged, thus it is now easier to add a new chunker.

Jun 03 '25 19:06 ThomasWaldmann

@darkk please have a look at #8903.

buzhash64 is not using AES, so likely not that cryptographically strong, but using a 64bit hash was suggested by one of these recent CDC security papers. It seems to be about as fast as the 32bit buzhash we use in borg 1.x.

Also, the seed value (aka "chunker secret") could be much more now, up to 64-256bits.

Jun 05 '25 17:06 ThomasWaldmann

I added optional support for encrypting the rolling hash value before the chunking decision is made. About 7x slower when enabled.

Jun 08 '25 01:06 ThomasWaldmann

using a 64bit hash was suggested by one of these recent CDC security papers

If I understand "Chunking Attacks on File Backup Services using Content-Defined Chunking" paper correctly, that effectively means that they need 781 "clashes" instead of 391 in this case. So, as far as I understand, that increase does not make the attack way harder — it's just twice more data that is needed to recover full BuzTable[].

That correlates well with ≈745 chunks in Duplicacy paragraph in "Breaking and Fixing Content-Defined Chunking" paper.

It's also interesting to note that only the second paper paper relies on the fact that the effective secret was just 32 bits. The first one decided to attack algorithm as a whole, not an specific implementation.

buzhash64 is not using AES

I feel a bit uneasy about AES-ECB(Buzhash64) per se as it's keyed mapping from u64 to u128 (from buzhash value to AES block). It's unclear to me if it's safe to assume that lower 20 bits are still well-distributed for CDC purpose given skewed input. However, they, probably, are.

I added optional support for encrypting the rolling hash value before the chunking decision is made. About 7x slower when enabled.

Yep, it's somewhat similar to the numbers I got above. I was recently looking into CDC algorithms as well and I have an idea to try. While SipHash12(Gear64) runs at 15.2 cycles/byte in contrast with 1.26 cycles/byte for pure Gear64 on the aforementioned 10-year-old CPU — being x12 times slower... There is still room for vectorization!

Wild speculation follows:

Imagine either Gear-like or Buzhash-like compression function without substitution table, e.g.

hash32_next = (hash32 << 1) ^ u8_add
hash32_next = (hash32 <<< i) ^ u8_add ^ (u8_remove <<< f(i, win))

Imagine HalfSipHash() being the keyed finalization function. Add-Rotate-Xor functions are easy to vectorize, AVX2 is quite available, it operates on 256-bit vectors and it's capable of 8x u32 operations, so slowdown might be not that big...

However, I've not benchmarked that yet.

The biggest question that concerns me is the following. Is Poly-hashing-then-Encrypt or alike construction even worth the implementation & maintenance effort given "sets of small files" being still quite fingerprintable? What do you think?

please have a look at https://github.com/borgbackup/borg/pull/8903

Will do.

Jun 11 '25 23:06 darkk

"Chunking Attacks on File Backup Services using Content-Defined Chunking"

section 3.2: N (window length) has a default size of 4095, but is not "hard-coded", can be set via --chunker-params to a range of uneven values. Guess that increases the search space?
"Attack 3 will not work if compression is enabled, since any ambiguity factor in compression would be multiplicative across the c clashes." (from page 10)
So, doubling the hash size makes it only slightly harder to attack, but enabling compression makes the attack too expensive. So users can just enable compression to protect against attack 3.
section 3.5, page 15: "If we only worry about the case of compression enabled, then the only attack we have available is Attack 4. In this case, simply changing hashes to having 64-bit codomain instead of 32-bit would change the (2^10)^2 search in the algorithm to (2^42)^2, which should help protect against the attack." (so, if "having a 64bit codomain" means "using 64bit hashes", this is solved by buzhash64 chunker).

Jun 14 '25 16:06 ThomasWaldmann

Yes, these "chunker attack" defenses don't solve the problem of fingerprinting sets of (un-chunked) small files - but we already have obfuscation for these (and also for chunked bigger files). I recently added the padme algorithm.

Jun 14 '25 16:06 ThomasWaldmann

The new "buzhash64" chunker is now released in 2.0.0b18 in its basic form:

algorithm very similar to the old "buzhash", but computing 64bit hash values
the buzhash64 table is now perfectly balanced and initialized using a CSPRNG
chunking decision is directly based on the buzhash values still (sum & mask)

Ongoing work:

#8920 tries to encrypt the buzhash values before the chunking decision, but it is hard to make this fast (even with hw accelerated AES).

Jun 19 '25 12:06 ThomasWaldmann

I recently added the padme algorithm

That's beautiful design. It adds ≈3% overhead for megabyte-sized chunks while having guarantees well-described in the paper: "Reducing Metadata Leakage from Encrypted Files and Communication with PURBs" from PETS.

Do you think if obfuscate,250,lz4 makes sense as a default setting for 2.x or do you consider it an overkill?

Jul 12 '25 20:07 darkk

@darkk Not sure if that should be default, can you start a github discussion about it? Doesn't really fit into the "new chunker" issue.

Jul 13 '25 13:07 ThomasWaldmann