ssz
ssz copied to clipboard
Investigating Tree Hashing in Lodestar
Current Hashing Approach
Lodestar's architecture relies heavily on maintaining a full merkle tree of the beacon state. We represent the tree as a linked data structure, where each node is 1. immutable, 2. lazily computing but caching the hash of its children.
This allows us to minimize the number of hashes we need to perform during state transition. Since all hashes of a prestate are maintained, only the paths thru the tree to the "diff" need to be re-hashed. This reduces the computational cost of hashing.
Also, it allows us perform structural sharing, sharing the memory for beacon states with shared subtrees. For example, between epochs in a sync period, where sync committees remain constant, if we maintain a reference to two beacon states, both states will share the same underlying subtrees for the sync committees. This reduces the memory cost of maintaining several related states.
Hashing Function
We use as-sha256 with several optimizations
-
we rely on hash inputs always being 64 bytes - this allows us to precompute part of the sha256 internals for a decent gain
- https://github.com/ChainSafe/as-sha256/pull/51
- https://github.com/ChainSafe/as-sha256/pull/59
-
we avoid allocations inside library, only using fixed input/output buffers
-
Related hashing perf analysis https://github.com/ChainSafe/lodestar/issues/2206
Hash Cache Representation
The memory usage for each type of object in Javascript is not very efficient coming from a systems language intuition. We store hash objects NOT as 32-byte Uint8Array. A 32-byte Uint8Array takes 223 total bytes! There's a bunch of pointers and additional bookkeeping that's being stored behind the scenes.
We store hash objects as objects with 8 uint32 numbers. eg: {h0: 0, h1: 0, ..., h7: 0}
. This takes somewhere between 88 bytes and 216 bytes, depending on the sizes of the indiviudal component numbers. Smaller numbers are represented as Smi (small integer) as an immediate value, while larger numbers are stored on the heap. In practice, this happens TODO.
- https://github.com/ChainSafe/as-sha256/pull/58
- https://github.com/ChainSafe/lodestar/issues/2885
- https://github.com/ChainSafe/ssz/pull/219
How to improve?
The hashing speed in lodestar is quite low compared to other implementations.
- Investigate hashtree implementation, which hashes multiple hashes at once.
- Investigate using a systems language implementation, eg using napi-rs. This could be tried using popular libraries, which likely attempt some hardware acceleration. Also can be tried using a transliteration of as-sha256.
The memory of our hashes in a lodestar node constitute a lot of a running beacon node. And our memory usage, measured per-hash, is still very large compared to systems languages.
- Investigate using a systems language implementation, eg using napi-rs.
- Investigate using an alternative memory layout in javascript.
Results
Some experiments were made, results below
cayman/hash-object
- This branch has some experiments using napi-rs to store hash objects 'in rust' and using napi-rs for hashing
-
hashtree
does exceptionally well operating on largeUint8Array
s, not as well on rustHashObject
s. (seehashtree uint8array
row) (hashtree
code has since been pulled into a repo here:@chainsafe/hashtree-js
) - Using rust libraries for hashing was slower (see
rust
row), also using a rust port of as-sha256 was slower (seerust object rs-sha256
)
node -r ts-node/register node_modules/.bin/benchmark packages/ssz/test/perf/hash.test.ts
✓ hashing - hashtreeOne 1643.022 ops/s 608.6346 us/op - 45 runs 3.24 s
✓ hashing - hashtree 1564.410 ops/s 639.2188 us/op - 25 runs 2.13 s
✓ hashing - hashtree uint8array 46076.71 ops/s 21.70294 us/op - 468 runs 1.52 s
✓ hashing - rust 1789.679 ops/s 558.7596 us/op - 27 runs 2.03 s
✓ hashing - rust object rs-sha256 1438.692 ops/s 695.0758 us/op - 21 runs 2.01 s
✓ hashing - rust object as-sha256 550.4754 ops/s 1.816612 ms/op - 7 runs 1.87 s
✓ hashing - as-sha256 2276.654 ops/s 439.2412 us/op - 10 runs 0.959 s
Unfortunately, a rust HashObject
is more expensive, memory-wise, than the status quo.
node -r ts-node/register --expose-gc packages/ssz/test/memory/hash.test.ts
hash-object - rust - 1 - 101.7 bytes / instance
hash-object - js - 1 - 91.8 bytes / instance
cayman/hash-cache
- This branch replaces hash objects ({h0, h1, ..., h7}) with a hash cache and hash ids (a Smi reference to a memory offset)
- Unfortunately is slower and heavier :(.
node -r ts-node/register node_modules/.bin/benchmark packages/ssz/test/perf/eth2/hashTreeRoot.test.ts
hash cache
✓ hashTreeRoot Attestation - struct 37929.07 ops/s 26.36500 us/op - 16753 runs 0.505 s
✓ hashTreeRoot Attestation - tree 46307.02 ops/s 21.59500 us/op - 22688 runs 0.912 s
✓ hashTreeRoot SignedAggregateAndProof - struct 25187.65 ops/s 39.70200 us/op - 13416 runs 0.606 s
✓ hashTreeRoot SignedAggregateAndProof - tree 30281.92 ops/s 33.02300 us/op - 14954 runs 1.01 s
✓ hashTreeRoot SyncCommitteeMessage - struct 96609.02 ops/s 10.35100 us/op - 72446 runs 0.808 s
✓ hashTreeRoot SyncCommitteeMessage - tree 136072.9 ops/s 7.349000 us/op - 43560 runs 0.611 s
✓ hashTreeRoot SignedContributionAndProof - struct 35928.57 ops/s 27.83300 us/op - 26315 runs 0.808 s
✓ hashTreeRoot SignedContributionAndProof - tree 45607.95 ops/s 21.92600 us/op - 12233 runs 0.505 s
✓ hashTreeRoot SignedBeaconBlock - struct 374.1080 ops/s 2.673025 ms/op - 223 runs 1.13 s
✓ hashTreeRoot SignedBeaconBlock - tree 510.8943 ops/s 1.957352 ms/op - 82 runs 1.42 s
✓ hashTreeRoot Validator - struct 58159.82 ops/s 17.19400 us/op - 43131 runs 0.808 s
✓ hashTreeRoot Validator - tree 54773.51 ops/s 18.25700 us/op - 54961 runs 1.11 s
master
✓ hashTreeRoot Attestation - struct 37378.99 ops/s 26.75300 us/op - 16472 runs 0.505 s
✓ hashTreeRoot Attestation - tree 48546.05 ops/s 20.59900 us/op - 14022 runs 0.404 s
✓ hashTreeRoot SignedAggregateAndProof - struct 24892.96 ops/s 40.17200 us/op - 11015 runs 0.505 s
✓ hashTreeRoot SignedAggregateAndProof - tree 32836.41 ops/s 30.45400 us/op - 10084 runs 0.404 s
✓ hashTreeRoot SyncCommitteeMessage - struct 102343.7 ops/s 9.771000 us/op - 56987 runs 0.606 s
✓ hashTreeRoot SyncCommitteeMessage - tree 148104.3 ops/s 6.752000 us/op - 42104 runs 0.404 s
✓ hashTreeRoot SignedContributionAndProof - struct 36937.17 ops/s 27.07300 us/op - 9497 runs 0.303 s
✓ hashTreeRoot SignedContributionAndProof - tree 47938.64 ops/s 20.86000 us/op - 11127 runs 0.303 s
✓ hashTreeRoot SignedBeaconBlock - struct 383.4154 ops/s 2.608137 ms/op - 268 runs 1.24 s
✓ hashTreeRoot SignedBeaconBlock - tree 530.1876 ops/s 1.886125 ms/op - 283 runs 1.20 s
✓ hashTreeRoot Validator - struct 59245.22 ops/s 16.87900 us/op - 78104 runs 1.41 s
✓ hashTreeRoot Validator - tree 85623.77 ops/s 11.67900 us/op - 55251 runs 0.707 s
node -r ts-node/register --expose-gc packages/ssz/test/memory/eth2Objects.test.ts
hash cache
Attestation struct - 1421.0 bytes / instance
Attestation tree - 5272.3 bytes / instance
SignedAggregateAndProof struct - 2077.4 bytes / instance
SignedAggregateAndProof tree - 8099.1 bytes / instance
AggregationBits struct - 255.9 bytes / instance
AggregationBits tree - 1383.0 bytes / instance
SignedBeaconBlockPhase0 struct - 131438.2 bytes / instance
SignedBeaconBlockPhase0 tree - 484786.4 bytes / instance
master
Attestation struct - 1421.2 bytes / instance
Attestation tree - 3028.2 bytes / instance
SignedAggregateAndProof struct - 2077.7 bytes / instance
SignedAggregateAndProof tree - 4679.8 bytes / instance
AggregationBits struct - 265.5 bytes / instance
AggregationBits tree - 941.2 bytes / instance
SignedBeaconBlockPhase0 struct - 131435.5 bytes / instance
SignedBeaconBlockPhase0 tree - 277766.9 bytes / instance
cayman/napi-merkle-node
- This branch pulls "BranchNode" and "LeafNode" into rust and exposes a napi wrapper
Node
to javascript. This design allows most of the tree to exist in rust, with napi pointers into the tree as navigation demands. - Current testing shows this to be extremely slow, with the beacon node unable to stay synced. Have not had time to diagnose.
an interesting talk about sha256 for merkle tree regarding batch hash https://www.youtube.com/watch?v=NfK4np15E64
there are 2 implementations for now:
- SIMD which is 2x the current as-sha256 https://github.com/ChainSafe/ssz/issues/356
- hashtree 20x the current as-sha256 as shown above
There are 2 ways we can do the batch hash
- we can go from root to group same level hash computation and do it in batch from bottom to top
type HashComputation = {
src0: Node;
src1: Node;
dest: Node;
};
getHashComputation(level: number, hashCompsByLevel: Map<number, HashComputation[]>): void {
if (this.h0 === null) {
let hashComputations = hashCompsByLevel.get(level);
if (hashComputations === undefined) {
hashComputations = [];
hashCompsByLevel.set(level, hashComputations);
}
hashComputations.push({src0: this.left, src1: this.right, dest: this});
if (!this.left.isLeaf()) {
(this.left as BranchNode).getHashComputation(level + 1, hashCompsByLevel);
}
if (!this.right.isLeaf()) {
(this.right as BranchNode).getHashComputation(level + 1, hashCompsByLevel);
}
return;
}
// else stop the recursion, LeafNode should have h0
}
this traversal takes <10% of our current hash time
- or we can do it during the
ViewDU.commit()
which has the rebind inside. Need to take a closer look at this but this approach is more potential because we don't waste another traversal there
in progress POC to implement batch hash in ssz/lodestar https://hackmd.io/zj9N5RIqQfCqYz8Y1Xc_hA?view
with this branch te/hashtree_hasher
https://github.com/ChainSafe/lodestar/blob/63d3b6a022527f9ee202cf76aaaa778fadbb8316/packages/cli/src/hashtree.ts#L18
it consumes ~1GB more heap memory in lodestar
cc @wemeetagain @matthewkeil
update: resolved by not to extract uint32 from Uint32Array which caused the issue
Note to improve BeaconState.hashTreeRoot()
In order to execute this data structure HashComputation[][]
from bottom up, validators need to be hashed first because they are presented as structs not tree, and we do that in batch to improve performance
To do that there are 2 ways:
- when
commit()
also batch hash validators and store toBranchNodeStructs
, this is a little bit unconventional. But it keeps the design ofcommit()
: we can do multiple commit() call and dohashTreeRoot()
in the end. This is implemented in https://github.com/ChainSafe/lodestar/pull/6939 - after
commit
we come up with a data structure like{byLevel: HashComputation[][]; bottomNodes: Node[]}
, then BeaconState type need to implementexecuteHashComputations()
by hashingbottomNodes
in batch. This is more straightforward but has some drawbacks:- after the commit() we need to compute roots for
bottomNodes
first. Then if consumers have multiple commit() calls, each commit() should be followed byexecuteHashComputations()
- or after the commit(), just save
bottomNodes
that's not computed root. Then either in the next commit() call, or in hashTreeRoot() we have to compute roots forbottomNodes
. Then this has the same drawbacks like in 1) - or for each flow, we should have single commit(). This is a big constraint for us in the future, tried the approach in
te/commit_not_compute_root
- after the commit() we need to compute roots for