borg icon indicating copy to clipboard operation
borg copied to clipboard

unusual blake2b usage, adopt blake3?

Open ThomasWaldmann opened this issue 7 months ago • 7 comments

This is from master branch (same in 1.4-maint):

def blake2b_256(key, data):
    return hashlib.blake2b(key+data, digest_size=32).digest()

It is one of the chunkid hashes borg uses.

The usage there is unusual, the python docs rather suggest:

def blake2b_256(key, data):
    return hashlib.blake2b(data, key=key, digest_size=32).digest()

ThomasWaldmann avatar May 21 '25 22:05 ThomasWaldmann

This is used for the key:

def random_blake2b_256_key():
    # This might look a bit curious, but is the same construction used in the keyed mode of BLAKE2b.
    # Why limit the key to 64 bytes and pad it with 64 nulls nonetheless? The answer is that BLAKE2b
    # has a 128 byte block size, but only 64 bytes of internal state (this is also referred to as a
    # "local wide pipe" design, because the compression function transforms (block, state) => state,
    # and len(block) >= len(state), hence wide.)
    # In other words, a key longer than 64 bytes would have simply no advantage, since the function
    # has no way of propagating more than 64 bytes of entropy internally.
    # It's padded to a full block so that the key is never buffered internally by blake2b_update, ie.
    # it remains in a single memory location that can be tracked and could be erased securely, if we
    # wanted to.
    return os.urandom(64) + bytes(64)

ThomasWaldmann avatar May 21 '25 22:05 ThomasWaldmann

>>> from hashlib import blake2b as b2
>>> key = b"x"*64
>>> padding = b"\0"*64
>>> data = b"fwefergegegwgewrgfqewfqe"

>>> b2(data, key=key, digest_size=32).hexdigest()
'e0aa425d342c56646a4f580e533d909819efc0d3462d27b84a878e27c65e7e02'

>>> b2(data, key=key+padding, digest_size=32).hexdigest()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: maximum key length is 64 bytes

>>> b2(key+padding+data, digest_size=32).hexdigest()
'023009c0325e12fcadc54c1d911e7d39806b0dc6e790299f160e8093017d23e8'

>>> b2(key+data, digest_size=32).hexdigest()
'1a1621ce5f623fd940009eaa1188f78cc5773965105a7d099951edae433b8fb8'

ThomasWaldmann avatar May 21 '25 22:05 ThomasWaldmann

As this is the chunkid hash as well as the authentication MAC, we can't easily change it.

If there is a need to change it, borg2 breaking release might be the only good point in time.

borg transfer would freshly encrypt/authenticate anyway and at least when re-chunking, it would also re-compute the chunkid.

ThomasWaldmann avatar May 21 '25 22:05 ThomasWaldmann

for borg transfer from 1.x blake2b_legacy repos to 2.0 blake2b repos, we'll need to support both the old way to compute the digest (for authenticating data) as well as the new way.

Also, transferring such repos must involve re-chunking, so that the chunk IDs are re-computed using the new way.

Just realized that if we change the ID hash and thus need re-chunking anyway for all blake2 stuff, we could also just drop blake2 completely and use the even faster blake3 (see discussion in #45).

ThomasWaldmann avatar May 28 '25 21:05 ThomasWaldmann

In the refactoring to use hashlib the comment from the original code explaining the construction was lost: https://github.com/borgbackup/borg/pull/1819/files#diff-a69ff4c61022ef88812e2838be647dc0191849f0995a248e8a094dc9001af2c8R235

The linked PR also has the rationale for why the keyed mode isn't used directly, because it was sort of a novelty back then and not necessarily exposed by libraries (OpenSSL is cited).

enkore avatar May 30 '25 16:05 enkore

@enkore Thanks for digging this out.

From nowadays perspective, what do you think is good for borg2?

  • keep blake2b hashing compatible to borg 1.x, so chunk ids stay compatible (but use that unusual code)
  • use the blake2b api to give the key separately and invoke the special keyed mode (IDs not compatible, but no big issue as borg transfer could re-chunk / re-ID everything)
  • remove blake2b from borg2 (except for reading borg 1.x repos) and adopt blake3 (which is a lot faster, but has rust dependency when building from source). this would also require re-chunking / re-IDing.

ThomasWaldmann avatar May 30 '25 21:05 ThomasWaldmann

ping?

ThomasWaldmann avatar Jun 09 '25 19:06 ThomasWaldmann