stacks-core [MARF] Investigate sparse trie representations

@kantai and myself suspect that the bulk of redundant data in a MARF index comes from one of a handful of places:

When performing a copy-on-write to produce a new trie, most of the nodes on the trie path will not change very much between two subsequent tries.
When performing a copy-on-write, most trie backpointers will point to the same nodes they did when they were created
A substantial number of leaves will have the (same) empty hash, since their keys are unmapped
We don't use varint encodings for integers, even though most integers fit into fewer than four bytes

I need to figure out how much of a potential space savings we can get by addressing some or all of the above.

Oct 13 '25 18:10 jcnelson

Discoveries so far:

Not storing the back_block field in a TriePtr for non-backptr nodes saves 10-15% of space in tests. This value is always 0 for non-backptr nodes
Not storing the intermediate hashes of each node (which are only needed on Merkle proof construction) saves about 5% of space in tests. These can be regenerated on-the-fly.

Oct 16 '25 02:10 jcnelson

More discoveries:

Compressing sparse lists of TriePtr (whose contents are largely empty pointers) produces another 5-10% space savings. We instead store a bitmap of which TriePtr slots are non-empty, and only store the non-empty TriePtrs.
Storing only the difference in child pointers between a node in an ancestor trie and the copy of the node pulled forward into a new trie via copy-on-write (COW) produces a whopping 30-50% space savings. We limit the "patch depth" to 16 in this test; if a node would need to be patched 17 or more times, we instead store the COW node directly (in order to limit the read amplification of scanning through a sequence of patches across ancestor tries).

I haven't tried this on chainstate yet, but in microbencharks I'm seeing 50%-75% overall size reduction with my patch set.

Nov 01 '25 02:11 jcnelson