[MARF] Investigate sparse trie representations
@kantai and myself suspect that the bulk of redundant data in a MARF index comes from one of a handful of places:
-
When performing a copy-on-write to produce a new trie, most of the nodes on the trie path will not change very much between two subsequent tries.
-
When performing a copy-on-write, most trie backpointers will point to the same nodes they did when they were created
-
A substantial number of leaves will have the (same) empty hash, since their keys are unmapped
-
We don't use varint encodings for integers, even though most integers fit into fewer than four bytes
I need to figure out how much of a potential space savings we can get by addressing some or all of the above.
Discoveries so far:
-
Not storing the
back_blockfield in aTriePtrfor non-backptr nodes saves 10-15% of space in tests. This value is always0for non-backptr nodes -
Not storing the intermediate hashes of each node (which are only needed on Merkle proof construction) saves about 5% of space in tests. These can be regenerated on-the-fly.
More discoveries:
-
Compressing sparse lists of
TriePtr(whose contents are largely empty pointers) produces another 5-10% space savings. We instead store a bitmap of whichTriePtrslots are non-empty, and only store the non-emptyTriePtrs. -
Storing only the difference in child pointers between a node in an ancestor trie and the copy of the node pulled forward into a new trie via copy-on-write (COW) produces a whopping 30-50% space savings. We limit the "patch depth" to 16 in this test; if a node would need to be patched 17 or more times, we instead store the COW node directly (in order to limit the read amplification of scanning through a sequence of patches across ancestor tries).
I haven't tried this on chainstate yet, but in microbencharks I'm seeing 50%-75% overall size reduction with my patch set.