Feat/marf compression
This PR implements work discovered in #6593. Specifically, it makes the following changes to the on-disk representation of the MARF:
-
If a
TriePtris not a back-block pointer, then theback_blockfield is not stored since it's always 0. -
If a node's children
TriePtrlist is sparse, then instead of storingTriePtr::default()for empty pointers, the system now stores a bitmap of which pointers are non-empty, and then only stores the non-emptyTriePtrs. It only does this if this actually saves space over just storing the entire list (storing the entire list is cheaper if the list is nearly full). -
Instead of copying a trie node from one trie to the next as part of a copy-on-write, the system will only store a patch from the old node to the new node. It will create a list of up to 16 patches across 16 tries before storing a full copy of the node.
In my (very small scale) benchmarks, this saves over 50% of space.
This PR is still a draft and will likely remain so for some time. It needs a lot more unit tests, and it would benefit significantly from fuzz and property testing against the current implementation. In addition, it will need a lot of performance tuning, since the act of reading a list of patch nodes will slow down reads.
Bitmap flag (aka 0xff): Could we use a bit of the Node ID to indicate whether a bitmap is present? This could save one byte per node and rely on the node format described by the Node ID itself.
The need for the sparse TriePtr bitmap flag is to ensure that this byte cannot be interpreted as a TrieNodeID, so I don't think so.
Compression at target: The current code correctly handles both compressed and uncompressed MARF scenarios. Once compression becomes the standard, could we plan on simplify the code to follow a single “linear” path, keeping only the compression-related logic?
I don't think so. We have no way to compel existing node operators to re-compress their chainstate, and they could continue to use uncompressed chainstate even across hard forks.
Node ID as a type: Currently, TriePtr id is represented as u8 (not introduced by this PR). Since we frequently interact with it, checking flags and node types, it might be useful to introduce a dedicated type. This could encapsulate convenient behaviors and avoid repeated conversions and manual flag interpretation. (In that case, this change could be addressed in a separate PR)
That's fine in principle, but that kind of refactoring would change potentially many lines. Let's first make sure the functional test coverage is sufficiently adequate that we feel comfortable shipping this, and then we can worry about the refactoring in a follow-up PR.
Alright, all CI tests pass @federico-stacks
Codecov Report
:x: Patch coverage is 73.45845% with 396 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 63.31%. Comparing base (c9003aa) to head (fdc3e60).
:x: Your project check has failed because the head coverage (63.31%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.
:exclamation: There is a different number of reports uploaded between BASE (c9003aa) and HEAD (fdc3e60). Click for more details.
HEAD has 10 uploads less than BASE
Flag BASE (c9003aa) HEAD (fdc3e60) 73 63
Additional details and impacted files
@@ Coverage Diff @@
## develop #6654 +/- ##
============================================
- Coverage 77.90% 63.31% -14.59%
============================================
Files 580 580
Lines 361187 362181 +994
============================================
- Hits 281374 229307 -52067
- Misses 79813 132874 +53061
| Files with missing lines | Coverage Δ | |
|---|---|---|
| stackslib/src/chainstate/stacks/index/cache.rs | 85.08% <ø> (-11.15%) |
:arrow_down: |
| stackslib/src/chainstate/stacks/index/test/file.rs | 57.89% <ø> (-38.95%) |
:arrow_down: |
| stackslib/src/chainstate/stacks/index/test/mod.rs | 86.27% <100.00%> (-7.76%) |
:arrow_down: |
| stackslib/src/chainstate/stacks/index/trie_sql.rs | 85.36% <100.00%> (ø) |
|
| stackslib/src/chainstate/stacks/index/trie.rs | 84.59% <66.66%> (ø) |
|
| stackslib/src/chainstate/stacks/index/file.rs | 87.45% <60.00%> (-2.58%) |
:arrow_down: |
| stacks-common/src/codec/mod.rs | 80.17% <62.50%> (-4.09%) |
:arrow_down: |
| stackslib/src/chainstate/stacks/index/mod.rs | 46.57% <0.00%> (-1.32%) |
:arrow_down: |
| stackslib/src/chainstate/stacks/index/marf.rs | 78.94% <82.14%> (-3.53%) |
:arrow_down: |
| stackslib/src/chainstate/stacks/index/test/marf.rs | 31.19% <88.40%> (-12.29%) |
:arrow_down: |
| ... and 4 more |
... and 370 files with indirect coverage changes
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c9003aa...fdc3e60. Read the comment docs.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@federico-stacks Please, do not force-push changes. It makes it very difficult to see what has changed in-between reviews. Instead, please merge the two branches and create a merge commit. Thanks!
@federico-stacks Please, do not force-push changes. It makes it very difficult to see what has changed in-between reviews. Instead, please merge the two branches and create a merge commit. Thanks!
Clear. In this case, I only force-pushed the last commit (the Node patch fix) because I had missed a small related change. I did it that way to keep everything contained in a single commit and make it easier to reference in the related comment (and possibly simplify the review). No commit history was lost.