ssz Investigating Tree Hashing in Lodestar

trafficstars

Current Hashing Approach

Lodestar's architecture relies heavily on maintaining a full merkle tree of the beacon state. We represent the tree as a linked data structure, where each node is 1. immutable, 2. lazily computing but caching the hash of its children.

This allows us to minimize the number of hashes we need to perform during state transition. Since all hashes of a prestate are maintained, only the paths thru the tree to the "diff" need to be re-hashed. This reduces the computational cost of hashing.

Also, it allows us perform structural sharing, sharing the memory for beacon states with shared subtrees. For example, between epochs in a sync period, where sync committees remain constant, if we maintain a reference to two beacon states, both states will share the same underlying subtrees for the sync committees. This reduces the memory cost of maintaining several related states.

Hashing Function

We use as-sha256 with several optimizations

we rely on hash inputs always being 64 bytes - this allows us to precompute part of the sha256 internals for a decent gain
- https://github.com/ChainSafe/as-sha256/pull/51
- https://github.com/ChainSafe/as-sha256/pull/59
we avoid allocations inside library, only using fixed input/output buffers
Related hashing perf analysis https://github.com/ChainSafe/lodestar/issues/2206

Hash Cache Representation

The memory usage for each type of object in Javascript is not very efficient coming from a systems language intuition. We store hash objects NOT as 32-byte Uint8Array. A 32-byte Uint8Array takes 223 total bytes! There's a bunch of pointers and additional bookkeeping that's being stored behind the scenes. We store hash objects as objects with 8 uint32 numbers. eg: {h0: 0, h1: 0, ..., h7: 0}. This takes somewhere between 88 bytes and 216 bytes, depending on the sizes of the indiviudal component numbers. Smaller numbers are represented as Smi (small integer) as an immediate value, while larger numbers are stored on the heap. In practice, this happens TODO.

https://github.com/ChainSafe/as-sha256/pull/58
https://github.com/ChainSafe/lodestar/issues/2885
https://github.com/ChainSafe/ssz/pull/219

How to improve?

The hashing speed in lodestar is quite low compared to other implementations.

Investigate hashtree implementation, which hashes multiple hashes at once.
Investigate using a systems language implementation, eg using napi-rs. This could be tried using popular libraries, which likely attempt some hardware acceleration. Also can be tried using a transliteration of as-sha256.

The memory of our hashes in a lodestar node constitute a lot of a running beacon node. And our memory usage, measured per-hash, is still very large compared to systems languages.

Investigate using a systems language implementation, eg using napi-rs.
Investigate using an alternative memory layout in javascript.

Results

Some experiments were made, results below

`cayman/hash-object`

This branch has some experiments using napi-rs to store hash objects 'in rust' and using napi-rs for hashing
hashtree does exceptionally well operating on large Uint8Arrays, not as well on rust HashObjects. (see hashtree uint8array row) (hashtree code has since been pulled into a repo here: @chainsafe/hashtree-js)
Using rust libraries for hashing was slower (see rust row), also using a rust port of as-sha256 was slower (see rust object rs-sha256)

node -r ts-node/register node_modules/.bin/benchmark packages/ssz/test/perf/hash.test.ts

    ✓ hashing - hashtreeOne                                               1643.022 ops/s    608.6346 us/op        -         45 runs   3.24 s
    ✓ hashing - hashtree                                                  1564.410 ops/s    639.2188 us/op        -         25 runs   2.13 s
    ✓ hashing - hashtree uint8array                                       46076.71 ops/s    21.70294 us/op        -        468 runs   1.52 s
    ✓ hashing - rust                                                      1789.679 ops/s    558.7596 us/op        -         27 runs   2.03 s
    ✓ hashing - rust object rs-sha256                                     1438.692 ops/s    695.0758 us/op        -         21 runs   2.01 s
    ✓ hashing - rust object as-sha256                                     550.4754 ops/s    1.816612 ms/op        -          7 runs   1.87 s
    ✓ hashing - as-sha256                                                 2276.654 ops/s    439.2412 us/op        -         10 runs  0.959 s

Unfortunately, a rust HashObject is more expensive, memory-wise, than the status quo.

node -r ts-node/register --expose-gc packages/ssz/test/memory/hash.test.ts

hash-object - rust - 1                   - 101.7 bytes / instance
hash-object - js - 1                     - 91.8 bytes / instance

`cayman/hash-cache`

This branch replaces hash objects ({h0, h1, ..., h7}) with a hash cache and hash ids (a Smi reference to a memory offset)
Unfortunately is slower and heavier :(.

node -r ts-node/register node_modules/.bin/benchmark packages/ssz/test/perf/eth2/hashTreeRoot.test.ts

hash cache

    ✓ hashTreeRoot Attestation - struct                                   37929.07 ops/s    26.36500 us/op        -      16753 runs  0.505 s
    ✓ hashTreeRoot Attestation - tree                                     46307.02 ops/s    21.59500 us/op        -      22688 runs  0.912 s
    ✓ hashTreeRoot SignedAggregateAndProof - struct                       25187.65 ops/s    39.70200 us/op        -      13416 runs  0.606 s
    ✓ hashTreeRoot SignedAggregateAndProof - tree                         30281.92 ops/s    33.02300 us/op        -      14954 runs   1.01 s
    ✓ hashTreeRoot SyncCommitteeMessage - struct                          96609.02 ops/s    10.35100 us/op        -      72446 runs  0.808 s
    ✓ hashTreeRoot SyncCommitteeMessage - tree                            136072.9 ops/s    7.349000 us/op        -      43560 runs  0.611 s
    ✓ hashTreeRoot SignedContributionAndProof - struct                    35928.57 ops/s    27.83300 us/op        -      26315 runs  0.808 s
    ✓ hashTreeRoot SignedContributionAndProof - tree                      45607.95 ops/s    21.92600 us/op        -      12233 runs  0.505 s
    ✓ hashTreeRoot SignedBeaconBlock - struct                             374.1080 ops/s    2.673025 ms/op        -        223 runs   1.13 s
    ✓ hashTreeRoot SignedBeaconBlock - tree                               510.8943 ops/s    1.957352 ms/op        -         82 runs   1.42 s
    ✓ hashTreeRoot Validator - struct                                     58159.82 ops/s    17.19400 us/op        -      43131 runs  0.808 s
    ✓ hashTreeRoot Validator - tree                                       54773.51 ops/s    18.25700 us/op        -      54961 runs   1.11 s

master

    ✓ hashTreeRoot Attestation - struct                                   37378.99 ops/s    26.75300 us/op        -      16472 runs  0.505 s
    ✓ hashTreeRoot Attestation - tree                                     48546.05 ops/s    20.59900 us/op        -      14022 runs  0.404 s
    ✓ hashTreeRoot SignedAggregateAndProof - struct                       24892.96 ops/s    40.17200 us/op        -      11015 runs  0.505 s
    ✓ hashTreeRoot SignedAggregateAndProof - tree                         32836.41 ops/s    30.45400 us/op        -      10084 runs  0.404 s
    ✓ hashTreeRoot SyncCommitteeMessage - struct                          102343.7 ops/s    9.771000 us/op        -      56987 runs  0.606 s
    ✓ hashTreeRoot SyncCommitteeMessage - tree                            148104.3 ops/s    6.752000 us/op        -      42104 runs  0.404 s
    ✓ hashTreeRoot SignedContributionAndProof - struct                    36937.17 ops/s    27.07300 us/op        -       9497 runs  0.303 s
    ✓ hashTreeRoot SignedContributionAndProof - tree                      47938.64 ops/s    20.86000 us/op        -      11127 runs  0.303 s
    ✓ hashTreeRoot SignedBeaconBlock - struct                             383.4154 ops/s    2.608137 ms/op        -        268 runs   1.24 s
    ✓ hashTreeRoot SignedBeaconBlock - tree                               530.1876 ops/s    1.886125 ms/op        -        283 runs   1.20 s
    ✓ hashTreeRoot Validator - struct                                     59245.22 ops/s    16.87900 us/op        -      78104 runs   1.41 s
    ✓ hashTreeRoot Validator - tree                                       85623.77 ops/s    11.67900 us/op        -      55251 runs  0.707 s

node -r ts-node/register --expose-gc packages/ssz/test/memory/eth2Objects.test.ts

hash cache

Attestation struct             - 1421.0 bytes / instance
Attestation tree               - 5272.3 bytes / instance
SignedAggregateAndProof struct - 2077.4 bytes / instance
SignedAggregateAndProof tree   - 8099.1 bytes / instance
AggregationBits struct         - 255.9 bytes / instance
AggregationBits tree           - 1383.0 bytes / instance
SignedBeaconBlockPhase0 struct - 131438.2 bytes / instance
SignedBeaconBlockPhase0 tree   - 484786.4 bytes / instance

master

Attestation struct             - 1421.2 bytes / instance
Attestation tree               - 3028.2 bytes / instance
SignedAggregateAndProof struct - 2077.7 bytes / instance
SignedAggregateAndProof tree   - 4679.8 bytes / instance
AggregationBits struct         - 265.5 bytes / instance
AggregationBits tree           - 941.2 bytes / instance
SignedBeaconBlockPhase0 struct - 131435.5 bytes / instance
SignedBeaconBlockPhase0 tree   - 277766.9 bytes / instance

`cayman/napi-merkle-node`

This branch pulls "BranchNode" and "LeafNode" into rust and exposes a napi wrapper Node to javascript. This design allows most of the tree to exist in rust, with napi pointers into the tree as navigation demands.
Current testing shows this to be extremely slow, with the beacon node unable to stay synced. Have not had time to diagnose.

Mar 20 '24 14:03 wemeetagain

an interesting talk about sha256 for merkle tree regarding batch hash https://www.youtube.com/watch?v=NfK4np15E64

there are 2 implementations for now:

SIMD which is 2x the current as-sha256 https://github.com/ChainSafe/ssz/issues/356
hashtree 20x the current as-sha256 as shown above

There are 2 ways we can do the batch hash

we can go from root to group same level hash computation and do it in batch from bottom to top

type HashComputation = {
  src0: Node;
  src1: Node;
  dest: Node;
};

getHashComputation(level: number, hashCompsByLevel: Map<number, HashComputation[]>): void {
    if (this.h0 === null) {
      let hashComputations = hashCompsByLevel.get(level);
      if (hashComputations === undefined) {
        hashComputations = [];
        hashCompsByLevel.set(level, hashComputations);
      }
      hashComputations.push({src0: this.left, src1: this.right, dest: this});
      if (!this.left.isLeaf()) {
        (this.left as BranchNode).getHashComputation(level + 1, hashCompsByLevel);
      }
      if (!this.right.isLeaf()) {
        (this.right as BranchNode).getHashComputation(level + 1, hashCompsByLevel);
      }

      return;
    }

    // else stop the recursion, LeafNode should have h0
  }

this traversal takes <10% of our current hash time

or we can do it during the ViewDU.commit() which has the rebind inside. Need to take a closer look at this but this approach is more potential because we don't waste another traversal there

Apr 24 '24 04:04 twoeths

in progress POC to implement batch hash in ssz/lodestar https://hackmd.io/zj9N5RIqQfCqYz8Y1Xc_hA?view

Jun 26 '24 02:06 twoeths

with this branch te/hashtree_hasher https://github.com/ChainSafe/lodestar/blob/63d3b6a022527f9ee202cf76aaaa778fadbb8316/packages/cli/src/hashtree.ts#L18

it consumes ~1GB more heap memory in lodestar Screenshot 2024-06-26 at 09 42 55

cc @wemeetagain @matthewkeil

update: resolved by not to extract uint32 from Uint32Array which caused the issue

Jun 26 '24 02:06 twoeths

Note to improve BeaconState.hashTreeRoot()

In order to execute this data structure HashComputation[][] from bottom up, validators need to be hashed first because they are presented as structs not tree, and we do that in batch to improve performance

To do that there are 2 ways:

when commit() also batch hash validators and store to BranchNodeStructs, this is a little bit unconventional. But it keeps the design of commit(): we can do multiple commit() call and do hashTreeRoot() in the end. This is implemented in https://github.com/ChainSafe/lodestar/pull/6939
after commit we come up with a data structure like {byLevel: HashComputation[][]; bottomNodes: Node[]}, then BeaconState type need to implement executeHashComputations() by hashing bottomNodes in batch. This is more straightforward but has some drawbacks:
- after the commit() we need to compute roots for bottomNodes first. Then if consumers have multiple commit() calls, each commit() should be followed by executeHashComputations()
- or after the commit(), just save bottomNodes that's not computed root. Then either in the next commit() call, or in hashTreeRoot() we have to compute roots for bottomNodes. Then this has the same drawbacks like in 1)
- or for each flow, we should have single commit(). This is a big constraint for us in the future, tried the approach in te/commit_not_compute_root

Jul 09 '24 13:07 twoeths

ssz ssz copied to clipboard

Investigating Tree Hashing in Lodestar

Current Hashing Approach

Hashing Function

Hash Cache Representation

How to improve?

Results

cayman/hash-object

cayman/hash-cache

cayman/napi-merkle-node

ssz
ssz copied to clipboard

`cayman/hash-object`

`cayman/hash-cache`

`cayman/napi-merkle-node`