ethsnarks icon indicating copy to clipboard operation
ethsnarks copied to clipboard

On-chain incremental Merkle tree append with reduced gas costs

Open HarryR opened this issue 5 years ago • 14 comments

The incremental merkle tree implementation included in MerkleTree.sol requires a large amount of gas to perform the update, this is because all of the nodes for the merkle tree are stored on-chain.

I have two proposals to reduce the cost of appending an item to the incremental merkle tree in a way which reduces the storage cost significantly by only storing the root and none of the intermediate levels or leaves.

Clients monitoring the contract would have to re-create the merkle tree by monitoring events emitted as every leaf is added to the tree. e.g. upon each append, it emits a LeafAppend(offset, data), then you retrieve every LeafAppend event and re-construct the merkle tree locally.

It would be possible for light-clients to use a verifiably honest third party to store the merkle tree for them, it would need to monitor the contract and be able to provide proofs for any item that can be verified using the on-chain contract.

  1. Pre-compute the whole merkle tree, using 'unique elements' as placeholders for everything, allow updates in-sequence by providing the elements proof and the new leaf hash.

  2. Require a proof of the last element in the tree

Note on 'unique items'

For any given node in the tree, to be used as an incremental merkle tree it's necessary to have 'synthetic and unique items' which can be computed quickly to create proofs using items which don't exist yet. These balance out the tree.

e.g. if you have a tree with one leaf (L) the second leaf is one of these synthetic items, it's a unique placeholder (X)

   r
 /   \
L     X

When you add another leaf, the synthetic item disappears, in its place the real item is used

   r
 /   \
L     o

These synthetic items allow you to make a merkle tree with a fixed depth (required for the prover circuit), with a fixed depth of 3 this becomes (with one item), which requires two synthetic items to balance out the tree.

     r
   /   \
  n     X
 /  \
L    X

Precomputation

Precomputing the whole tree would need to store (2*N)-1 items, e.g. 2<<31 items would require (2<<32)-1 total items in storage. However, this would only need to be computed once to find the root node, as all other items are predictable.

In this tree, every node and every leaf is synthetic, every node is real (as it's derived from values, albeit synthetic ones), and the root must be real, e.g.

         r
     /       \
    n         n
  /   \     /   \
 X     X   X     X

The root and every node can be created using a deterministic number of items in-memory at any point in time, with all unnecessary items (those which will never be used again) could be discarded as they can be can be re-computed.

The problem comes when verifying the root, which requires recomputation of everything in the path and its dependencies, to avoid the cost of computation you would have to store ((2*N)-1) * HashSize items on disk, with 2<<27 items (268m) and a hash size of 32 bytes this requires 8gb of storage.

This also requires the tree to be updated whenever items are added, otherwise subsequent proofs would be based on incorrect data, however it could be updated incrementally - changing only one item and then recomputing all the nodes in the upward path to the root.

The amount of storage vs the cost of re-computing the tree can be balanced out by caching/storing the top N levels of the tree, and requiring re-computation of N levels of the subtree. e.g. where C is cached the number of levels below is the number of items which need recomputing, e.g. 2^N, in this case N is one as only one level needs recomputing.

e.g. if height is 3, number cached is 2 then number recomputed is 2^(3-2)

         r
     /       \
    n         C
  /   \
 X     X 

e.g. if Height is 29, cached is 15, total size of cache (for 32byte items) is 1 MiB and (32768 items) and the number of items requiring recomputation is 16384, with SHA2-256 this number looks like a reasonable trade-off, however with LongsightF the field operations may make this many times more expensive.

The on-chain component knows the number of 'real' leaves, and requires a merkle proof of the next item (which doesn't exist yet) along with the new leaf value to re-compute the root, it doesn't need to store anything other than the root and the sequence (or, count, of real leaves).

Append via Proof of End

This follows the incremental merkle tree model more closely and requires no precomputation, however the on-chain contract must be aware of which nodes are synthetic and which aren't so that it can enforce the synthetic nodes values and re-compute ones from the previous proof which have been changed as a result.

e.g. if there is a tree with 1 item and you want to add another, you provide proof of the first item, from which proof of the 2nd item and therefore the root can be derived.

The advantage of this, versus the pre-computed tree, is that knowing the previous valid merkle path, sequence number and how the synthetic nodes are created then you can add an item to the tree without knowing all of the values. this makes it very efficient for 'light clients'.

See diagrams below for reference.

TODO: find simple algorithm to determine the common pairs from two merkle paths to verify that one was added immediately after the other. They must share at least 1 common node.

        r
     /     \
    n1      Y
  /   \   
 L1    X 

Proof of L1 is [0,0] [X,Y]

        r
     /     \
    n1      Y
  /   \  
 L1   L2

Proof of L2 is [1,0] [L1,Y]

        r
     /     \
    n1      n2
  /   \    /  \
 L1   L2  L3   Z

Proof of L3 is [0,1], [Z,n1]

        r
     /     \
    n1      n2
  /   \    /  \
 L1   L2  L3  L4

Proof of L4 is [1,1], [L3,n1]

Reducing size of proofs, by being 'synthetic aware'

Any synthetic node included in the path can be omitted if there is a deterministic algorithm to identify which nodes in the path are synthetic using only the sequence (total number of items) of the merkle tree.

e.g. with the following tree, providing a proof of L requires X and Y, where the address bits are [0, 0] and the path is [X, Y]. However, because the leaf X and node Y are synthetic the path can be empty and all that is required are the address bits - then the on-chain contract can re-create the synthetic parts. It would need to verify the synthetic items in the path anyway, otherwise tree proofs could potentially be faked.

       r
     /  \
    n    Y
  /   \
 L     X 

Because the synthetic items need to be verified, the only question is whether or not excluding them from the msg.data for the function parameters, versus the cost of keeping track of which items were synthetic vs which are real (e.g. if item 1 is synthetic, and item 2 is real, the path would be [X, Z] where Y is skipped) versus the cost of omitting these items.

Problems with both approaches

The problem with both of the approaches is that it only allows one insert into the merkle tree per block, as while there will be a conflict if two people try to append at the same time.

unfortunately the way to fix this is to store the merkle tree on-chain.

it may be possible to use a side-chain which aggregates updates and submits them to the main-chain in one go, however IMO this adds an unnecessary component which can be interfered with etc.

HarryR avatar Aug 19 '18 09:08 HarryR