erigon-lib
erigon-lib copied to clipboard
Create a binary version of hex_patricia_hashed - bin_patricia_hashes
hex_patricia_hashed
in the commitment
package is the implementation of the commitment that is currently used in Ethereum - Hexadecimal Patricia Merkle tree with pre-hashes keys.
hex
means that the Merkle tree is hexadecimal (rather than binary), i.e. each node has up to 16 children
patricia
means it follows PATRICIA design (Practical Algorithm to Retrieve Information Coded in Alphanumeric). Notable feature is compressing the runs of nodes having just 1 child, into a single node (extension and leaf nodes).
hashed
means that state keys are not directly used as paths in the tree (as it would be in a conventional radix tree), but they are first pre-processed by Keccak hash function, to resist the creation of specific structures of node to trigger worst-case performances of radix trees.
We need to create a variant of this called bin_patricia_hashed
, with difference that instead of 16, any node may have up to 2 children. This also means that instead of using 4 bits of the hashed key per level of the tree, we need to use 1 bit per level.
Integrating this into prototype and running two variants alongside each other, would give us information about the overhead of binary variant in terms of number of nodes, and consequently, size of commitment files.
Finished implementation of bin patricia trie.
Goals of that implementation are following:
- [x] obtain same hashes as bintrie implementation at go-ethereum
- [x] ability to produce aggregation artifacts as it does by HexPatriciaHashed
- [ ] compare aggregation files sizes from bin and hex trie
First issue were resolved simply by using same hash approach as used in go-eth implementation . That approach uses node paths (here and after node path is a prefix you need to follow from root to that node) as well as node values, which requires more hash re-evaluations on insertion. E.g. in case when we insert new node which splits existing node path - we should re-evaluate all hashes from that node to it's leaves, then re-evaluate hashes for all parental nodes up to the root. IMO, that approach is discussable, but for now, we use it.
Second issue required a bit more complex approach. To produce artifacts, aggregator uses binary branch updates, rewrites them on the fly (replacing hashed keys with stored murmur3
hashes and so on).
I tried to be as much closer to hex updates implementation as i could.
- Added new flag for branch nodes
BRANCH_PART
. - branches (node with L and R child) are encoded as [LSize, LPadding, LPrefix, RSize, RPadding, RPrefix]
- leaves encoded as [PlainKeySize, PlainKey, EncodedAccountSize, EncodedAccount]
- added functions for extracting plainkeys from binary branch updates and their processing.
- use of touchMap and afterMap, but using least two bits. 3 represents branch, 0 represents leaf
During insertion i do not update any hashes, they should be evaluated after ProcessUpdate
is finished. That lead to first deviation from hex updates - branchUpdates does not provide any node hashes (there are no valid hashes at the moment of insertion).
Next, hex branch updates with same prefix could be merged with each other, for binary trie i decided instead of merge just use new version for provided prefix.
That implementation currently does not support:
- deletions
- any storage updates, only balance\nonce updates
- partial hash evaluation instead of full re-hashing on call of
RootHash()
- we could not run erigon2 against chaindata directory processed for binary trie - while reading files we don't know, hex or bin update we met (maybe we should use fallback decode methods and stick to the worked one).
- Implemented deletions and storage updates
- get rid of usage of branch length in node hash
- reimplemented node and branch encoding, get rid of
BRANCH_PART
constant. Now we encode account\storage plain keys, node hashes. Branch node produces just branch hash, which led to correct branch merging after plain keys are replaced with murmur3 hash. - overall test coverage increased, get rid of obsolete tests, fixed update builder (delete update now encoded correctly)
My humble test on goerli
chain up to block 1.200.000:
==> bin <==
index files total
5.1M total
data files total
316M total
==> hex <==
index files total
4.9M total
data files total
203M total
Met an issue with decompressor (out out bound error) both on hex\bin impelmentation, so decided to rebase branch on erigon devel and erigon-lib main branches.
After update, hex commitment seems to proceed correctly, while bin implementation seems quite incorrectly performing valTransform:
panic: runtime error: slice bounds out of range [20:2]
goroutine 50 [running]:
github.com/ledgerwatch/erigon-lib/compress.(*Getter).Next(0x840d894c050, {0x0, 0x0, 0x0})
github.com/ledgerwatch/[email protected]/compress/decompress.go:390 +0x450
github.com/ledgerwatch/erigon-lib/aggregator.(*CommitmentValTransform).commitmentBinValTransform(0x84095657ed8, {0x840afa06720, 0x2b, 0x60}, {0x840cce2cba0, 0x0, 0x60})
github.com/ledgerwatch/[email protected]/aggregator/aggregator.go:1573 +0x378
github.com/ledgerwatch/erigon-lib/aggregator.(*Aggregator).mergeIntoStateFile(0x140001ca000, 0x840aef35890, 0x0, 0x3, 0x0, 0xf423f, {0x140002074d0, 0x25}, 0x84095657de8, 0x104439828, ...)
github.com/ledgerwatch/[email protected]/aggregator/aggregator.go:3344 +0x9c8
github.com/ledgerwatch/erigon-lib/aggregator.(*Aggregator).computeAggregation(0x140001ca000, 0x3, {0x840e8ee3180, 0xb, 0x10}, 0x0, 0xf423f, 0x8408a6f9de8, 0x104439828, 0x1, ...)
github.com/ledgerwatch/[email protected]/aggregator/aggregator.go:3122 +0x380
github.com/ledgerwatch/erigon-lib/aggregator.(*Aggregator).backgroundMerge(0x140001ca000)
github.com/ledgerwatch/[email protected]/aggregator/aggregator.go:1657 +0x668
created by github.com/ledgerwatch/erigon-lib/aggregator.NewAggregator
github.com/ledgerwatch/[email protected]/aggregator/aggregator.go:1123 +0xa18