specs icon indicating copy to clipboard operation
specs copied to clipboard

wip: hash consistent sorted trees

Open mikeal opened this issue 3 years ago • 7 comments

still in a very early draft state, but i’ve written enough in this web editor that i’m no longer comfortable not having it committed to a branch :)

mikeal avatar Nov 11 '20 00:11 mikeal

Now that I’m staring at it all written out I’m realizing that there are some even better approaches we can take.

I’ve been using these simple tail addresses, which are great because they’re 100% consistent so they don’t have much churn on branch merging. The problem is that they’re super vulnerable to attack. @mikolalysenko solved this by using a floating fingerprint which is where all these ideas originated.

But once you spell out the problem the chunker is solving it’s a lot simpler than this. You’re already feeding it fingerprints, so you have a randomized distribution you can convert to an integer. Since the chunker is a state machine, all you need to do is introduce a few hard limits on max chunk sizes and reset the state machine whenever you close a chunk. You’d get the consistency benefits of my approach and the safety of the entropy you get from the fingerprinter.

mikeal avatar Nov 11 '20 05:11 mikeal

@vmx i just re-wrote the whole doc. you can hold off on reading it for a bit, there’s a lot of churn right now.

mikeal avatar Nov 11 '20 20:11 mikeal

@ribasushi what do you think about this:

  • We drop all the settings to the chunker except for a TARGET_SIZE and base all of our rules on that.
  • We always pull a Uint32 from the tail of the hash to keep things simple.
When a chunk reaches twice the target size we start applying overflow protection.

In overflow protection we still close on the same entries as we would otherwise but 
we also compute a *sequence identity*. This is calculated by applying the hash 
function to the prior TARGET_SIZE hashes and the current hash. This gives us a 
unique identifier for each entry based on its placement. We convert the tail 
of this new hash to a Uint32 and compare it against the OVERFLOW_LIMIT.

The OVERFLOW_LIMIT is an integer that increases an equal amount from 0 on every 
entry until it reaches MAX_UINT32. The increase in OVERFLOW_LIMIT on each entry 
is `Math.top( MAXUINT32 / TARGET_SIZE)`.

This makes generating sequential data that will keep the chunk open highly difficult 
given a sufficient TARGET_SIZE and still produces relatively (although not completely) 
consistent chunk splits in overflow protection.

mikeal avatar Nov 12 '20 02:11 mikeal

UPDATE: below note was based on a misunderstanding, see second point below.

Do we need the complication of OVERFLOW_LIMIT?

  • We always deduplicate ( entries are unique )
  • ~We always sort~ detail I was missing: we do sort but in a user-defined way, NOT based on the content hash
  • We use cryptographic hashing

This means that inserting two entries next to each other is exceedingly difficult ( nearly preimage-attack level, albeit with lowered bit-ness )

Therefore: keep track of TARGET_SIZE, and whenever we cross it - instead of just taking the original hash of the node, byte-append the hash of the previous node and get the hash of that ( cheap, since you can reinit state in sha2 ). You then get a new unpredictable hash for the same leaf, and the leaves after it, until you break out of the attack, and things go back to normal.

ribasushi avatar Nov 12 '20 02:11 ribasushi

You have to keep in mind how large the address space is for not matching. If the target size is 1000 you have 99.9% chance of not closing under the defined limit. It’s actually not impossible to find a sequence in which every rolling hash stays within these boundaries especially if the target size is low which we do want to support. Increasing the threshold for a match slowly closes the available address space for an attack and since even a sequence that entirely matches would be evenly distributed it still makes matches fairly consistent after a small number of inserts. A large number of inserts is less of a concern because the overflow into the next node will push new data into the sequence and make any prior sequential attacks highly likely to be cut off by new overflow matches in the front of the sequence.

mikeal avatar Nov 12 '20 02:11 mikeal

What's the path forward on this?

If it's a tractable amount of work / there's time for it, can it be finished soon (i.e. in the next week or two)?

If it's not something that is likely to get carried much further forward at this time (or in the next couple weeks)... can we move it to be marked very clearly as a draft thing (and a comment about when it's likely to be expanded on), and merge it that way?

warpfork avatar Feb 22 '21 16:02 warpfork

Ping for if we can make this mergeable.

warpfork avatar Apr 11 '21 11:04 warpfork