stacks-core [Refactor] Refactor of Clarity storage (iteration 1)

❗❗❗ This PR draft is currently being wrapped-up and is NOT READY FOR CONSIDERATION ❗❗❗

Remaining Work:

[ ] More mainnet sync tests
[ ] Some additional unit tests are needed, particularly surrounding the Clarity cache
[ ] Finish DB migration implementation and tests
[ ] Performance benchmarking (w/hyperfine)
[ ] Storage benchmarking (size-on-disk)
[x] Create issues for changes

👋 Introduction

This work was born from a combination of the A/B-tester I have recently been working on and the open umbrella issue #4316. When testing the A/B-tester it became clear that replaying blocks up to more recent blocks in-between each bugfix takes way too much time, so I started investigating ways to optimize specifically the Clarity VM, with a focus on the oracle-v1 contracts. Needless to say, there are a number of optimizations which can be done. While there have been other initiatives looking for low-hanging-fruits, I decided to try tackling some deeper optimizations.

The primary goal of this PR is to lay some groundwork for refactoring & optimizing the persistence layer of the Stacks node, with focus on the Clarity database. I decided to start with two rather large data-structures which are, today, being serialized using serde_json and written/read to SQLite as plain-text JSON. These structures are read + deserialized a number of times during contract execution.

As seems to be usual for my core refactoring attempts, this started out as a "small and isolated" change which quickly cascaded onto a much larger surface area. The PR in itself is quite large, but I hope that the changes are clear - I did make an effort on documentation :)

⛓ Consensus

While this PR does potentially affect Clarity consensus-critical storage, no behavior [should have] been changed (i.e. consensus keys, serialization, etc.). Therefore this PR should be seen as non-consensus-critical -- however an extra eye on the consensus-critical code paths is warranted, just in-case I missed something that tests aren't catching.

✅️ Issues Resolved in this PR

#4444
#4449
- 📌 This issue touches a lot of files, but mostly due to adding speedy's Readable and Writable derives to types to be (de)serialized.
#4450
#4448
[trivial] #4446
- 📌 This issue touches a lot of files, but mostly due to adding fake's Dummy derive to types to be faked.
[trivial] #4447
[trivial] #4440
[trivial] #4445
[trivial] #4441
[trivial] #4442
[trivial] #4443

🚩 Additional Changes & Improvements

The AnalysisDatabase type has been removed and baked into the ClarityDatabase struct instead. This is motivated by the fact that these are two different interfaces working against the same database and tables, and thus moving such logic together helps to reduce the perceived complexity in preparation for a larger refactor. It would have been another story IMO if there had been a significant number of analysis-related methods we were talking about -- but since it's only a few, the additional abstraction only adds complexity and confusion.
- 📌 This change touches a lot of files, but primarily due to the renaming of AnalysisDatabase to ClarityDatabase and analysis_db to clarity_db.
This PR makes it an error to attempt to insert a contract multiple times into the backing store, even at different block hashes. Previously there was no check for this, allowing duplicate contracts to be deployed at different block heights, potentially without the caller realizing it. This situation needs to be explicitly handled now.

🧪 Testing

📌 A lot of /tests/*.rs etc. files have been touched due to renames and test-tweaking.

Unit tests have been written for new code.
- ℹ️ Some areas are still uncovered -- I'm working on more tests for them.
Added new tests for existing code, particularly surrounding the RollbackWrapper.
Existing tests have been adapted to work with the new paradigm, with a few "gotcha's":
- The clarity cache (#4444) is disabled by default in test and dev modes. This is because too many tests fail otherwise which are relying on very specific control of what exists/doesn't exist in the underlying database.
- A lot of tests use methods which now expect the contract and/or its analysis to exist in the underlying database, which now failed due to the use of ClarityDatabase::test_insert_contract_hash() -- this method only ensured that the contract hash existed in the MARF, so it has been changed to one of test_insert_contract() or test_insert_contract_with_analysis in said tests. These two methods have also been added to tests prior to their "need" of this stored data. Related to #4448.
- A number of tests expect to be able to insert the same contract multiple times, which is now an error. I've adapted these tests as well.
- ⚠️ Please put extra focus on reviewing all adapted tests -- it is possible that I misunderstood what the test was trying to accomplish, or simply made a mistake.
All tests in the workspace are passing.
A release-mode node has been partially synced to mainnet without issue.
- ℹ️ I'm working on more testing here.

⏰ Performance

The following benchmark shows a simulated single contract write / read in the current next vs. this branch. This benchmark was performed on a modern Intel CPU with an NVMe drive. Note that these benchmarks use disk-based databases and optimized includes both speedy serialization and lz4_flex compression, where next uses only serde_json serialization. No caching is used.

	`next`	`optimized`
`insert`	`672.04 us` (✅ 1.00x)	`237.78 us` (🚀 2.83x faster)
`select`	`295.95 us` (✅ 1.00x)	`117.13 us` (🚀 2.53x faster)

ℹ️ More to come with hyperfine + stacks-inspect replay benchmarking

🚩 New Dependencies

fake - Used to help with filling structs with random (but format-correct) data in tests. This is where the Dummy attribute littered throughout this PR comes from. Currently added as a dependency and not dev-dependency because I had issues with conditional compiles in test/dev mode -- but it should work to get this into dev-dependencies with a little more effort. All code using fake is conditional on cfg!(test), so it should be optimized away during a build anyway.
lru - Used in global caching of Clarity data, including contracts and analyses. Note that rusqlite uses this crate, and specifically its LruCache struct, internally.
lz4_flex - A pure-Rust implementation of the LZ4 compression algorithm. Currently set to use safe-only code, but default-features = false enables its "performance mode" with unsafe code.
speedy - A very fast binary serializer written in pure Rust. This is where the Readable and Writable attributes littered throughout this PR come from. This is only used for node-specific, non-consensus data (i.e. data which was previously stored in the Clarity metadata_table SQLite table):
- Serialized ContractContext
- Serialized ContractAnalysis

💡 Some Potential Next Steps

This PR sets the stage for further refactoring the Clarity (and other) database(s). Some ideas for next steps are things like (but not limited to):

Optimizing how Clarity maps are stored. Maps are currently JSON-serialized in their entirety and written/read to/from the metadata_table database table. By normalizing this structure we could perform much more efficient, targeted reads and writes for specific map keys.
Clarity lists are another contender with the same motivations as above. Lists could be normalized and operations like index-of, element-at, slice etc. could be delegated to the underlying database engine, which specializes in query logic, selecting only the items needed for a particular operation.
Clarity functions can be normalized to allow fast lookup and validation of contract functions prior to loading the entire contract for a contract call.
Ensuring all binary data is stored as binary instead of its hexadecimal string representation.
Changing serialization of the remainder of non-consensus-critical node-local data to use speedy instead of serde_json.
Employ compression on stored Clarity values which are large.
Moving on to optimizing some of the other databases, with focus on migrating binary stored as hexidecimal strings to raw bytes.

Feb 27 '24 18:02 cylewitruk

Codecov Report

:x: Patch coverage is 82.89778% with 517 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 83.40%. Comparing base (383d586) to head (ed6615e). :warning: Report is 670 commits behind head on next.

Files with missing lines	Patch %	Lines
clarity/src/vm/fakes.rs	39.92%	152 Missing :warning:
clarity/src/vm/database/sqlite.rs	80.68%	90 Missing :warning:
clarity/src/vm/database/cache.rs	57.14%	51 Missing :warning:
clarity/src/vm/errors.rs	13.72%	44 Missing :warning:
clarity/src/vm/database/clarity_store.rs	85.71%	32 Missing :warning:
clarity/src/vm/database/clarity_db.rs	89.34%	26 Missing :warning:
clarity/src/vm/database/key_value_wrapper.rs	91.62%	15 Missing :warning:
clarity/src/vm/database/structures.rs	54.16%	11 Missing :warning:
stacks-common/src/util/hashmap.rs	76.08%	11 Missing :warning:
stacks-common/src/util/hashset.rs	76.59%	11 Missing :warning:
... and 19 more

Additional details and impacted files

@@            Coverage Diff             @@
##             next    #4437      +/-   ##
==========================================
- Coverage   83.46%   83.40%   -0.06%     
==========================================
  Files         448      455       +7     
  Lines      324321   326383    +2062     
==========================================
+ Hits       270698   272233    +1535     
- Misses      53623    54150     +527

Files with missing lines	Coverage Δ
clarity/src/vm/analysis/arithmetic_checker/mod.rs	`92.75% <ø> (ø)`
.../src/vm/analysis/contract_interface_builder/mod.rs	`95.20% <100.00%> (ø)`
clarity/src/vm/analysis/mod.rs	`95.23% <100.00%> (+0.14%)`	:arrow_up:
clarity/src/vm/analysis/read_only_checker/mod.rs	`87.16% <100.00%> (ø)`
clarity/src/vm/analysis/read_only_checker/tests.rs	`100.00% <100.00%> (ø)`
clarity/src/vm/analysis/tests/mod.rs	`100.00% <ø> (ø)`
clarity/src/vm/analysis/trait_checker/mod.rs	`100.00% <100.00%> (ø)`
clarity/src/vm/analysis/type_checker/contexts.rs	`93.50% <100.00%> (ø)`
clarity/src/vm/analysis/type_checker/mod.rs	`91.66% <100.00%> (ø)`
...src/vm/analysis/type_checker/v2_05/tests/assets.rs	`100.00% <100.00%> (ø)`
... and 74 more

... and 23 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 383d586...ed6615e. Read the comment docs.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Feb 27 '24 19:02 codecov[bot]

I realize that this PR may look intimidating. Now that I've taken the time to wrap my head around all of the changes and box-in the different moving parts it's easier for me (mentally) to break this into smaller PR's if preferred -- in which case I would break it up as follows:

The following, in no particular order:

#4441
#4442
#4443
#4445
#4446

These issues together:

#4448
#4449
#4450

Followed by these, in no particular order:

#4444
#4440

While #4447 could be incrementally applied depending on the code changes.

Feb 28 '24 14:02 cylewitruk

I'm looking at the speedy docs and it looks like the purpose of this crate to read/write to a buffer with as little overhead as possible. Does it contain any sort of versioning info or other metadata? What happens if in the future we add/remove/reorder fields in a serializable struct?

EDIT: Have you considered rkyv? It's similar (fast, minimalistic binary protocol, not "self describing", lacks schema evolution), but much more widely used (several popular crates support it, such as hashbrown) and has some additional features, like a limited form of validation

Feb 28 '24 15:02 jbencin

I'm looking at the speedy docs and it looks like the purpose of this crate to read/write to a buffer with as little overhead as possible. Does it contain any sort of versioning info or other metadata? What happens if in the future we add/remove/reorder fields in a serializable struct?

Great question!

When it comes to speedy, you can add and remove fields, but:

to be able to deserialize a v1 { name } into a v2 { name, phone } then phone (the added field) would need the #[speedy(default_on_eof)] attribute to indicate that it's OK if it's empty.
nothing is needed to deserialize from a v2 { name, phone } into a v1 { name }.
however, ordering must be maintained.

Meaning that if there were any ordering changes then a migration would need to be written.

EDIT: Have you considered rkyv? It's similar (fast, minimalistic binary protocol, not "self describing", lacks schema evolution), but much more widely used (several popular crates support it, such as hashbrown) and has some additional features, like a limited form of validation

Yeah, I had a look at rkyv as well (prior to speedy) - it's a bit less straightforward, but I'd be more than happy to add a bench and test suite for adding/removing/reordering like I did with speedy :+1:

EDIT: There is actually a discussion here regarding speedy and forwards-compatibility including a potential solution. Tbh it doesn't look like there was really any thought into forwards-/backwards-compatibility in the existing [Stacks] code, but then again field names are serialized.., but this is a good discussion to have so that we make sure we account for versioning moving forward.

EDIT: Just wanted to toss in that the actual serialization implementation is only a few lines of code, so it doesn't really matter which encoding we end up with as long as it has a serde-like interface, it's trivial to swap out prior to merging -- the bigger work is done, i.e. getting binary serialization+storage in-place :smile:

Feb 28 '24 17:02 cylewitruk

I'd be more than happy to add a bench and test suite for adding/removing/reordering

There are some nice benchmarks of Rust serialization frameworks here. Both are similar in performance, and either one would be much faster than JSON, so the choice should depend on other aspects

it doesn't look like there was really any thought into forwards-/backwards-compatibility in the existing [Stacks] code, but then again field names are serialized

JSON is "self-describing" in that the field names, types (to some extent), as well as the overall structure is encoded into the output data. The extra metadata gives the application information on what the data is and what to do with it, so even if the application's internal structures change, it can use that information to ignore unrecognized fields, know if a field is missing and use a default value, or fail with a descriptive error

On the other hand, many of these fast binary formats get their speed by omitting the extra metadata. Instead, they make the assumption that the application knows exactly how the data should be structured. This limits the validation you can do, and the portability between languages, CPU architectures, and even different versions of the application. Sometimes, the trade-off worth it, depending on the use-case. What I'm specifically concerned about here is:

If something changes in the binary format...
- Is there any chance of silent data corruption, like blindly memcpying binary data into the wrong variables?
- In the event of an explicit failure, do we get meaningful error messages, like which struct/field it failed at?
Would this be more brittle than the current JSON encoding? Would we have to rebuild the chainstate for changes that would be backwards compatible if we were using JSON?

There are also some fast binary serialization formats that are portable and support versioning, like Cap’n Proto or FlatBuffers, but they work by generating code from a custom schema language, so there's significant added complexity (much more work than adding a #[derive(...)]). Probably not something we want to invest in right now

Feb 28 '24 19:02 jbencin