[Refactor] Refactor of Clarity storage (iteration 1)
❗❗❗ This PR draft is currently being wrapped-up and is NOT READY FOR CONSIDERATION ❗❗❗
Remaining Work:
- [ ] More mainnet sync tests
- [ ] Some additional unit tests are needed, particularly surrounding the Clarity cache
- [ ] Finish DB migration implementation and tests
- [ ] Performance benchmarking (w/hyperfine)
- [ ] Storage benchmarking (size-on-disk)
- [x] Create issues for changes
👋 Introduction
This work was born from a combination of the A/B-tester I have recently been working on and the open umbrella issue #4316. When testing the A/B-tester it became clear that replaying blocks up to more recent blocks in-between each bugfix takes way too much time, so I started investigating ways to optimize specifically the Clarity VM, with a focus on the oracle-v1 contracts. Needless to say, there are a number of optimizations which can be done. While there have been other initiatives looking for low-hanging-fruits, I decided to try tackling some deeper optimizations.
The primary goal of this PR is to lay some groundwork for refactoring & optimizing the persistence layer of the Stacks node, with focus on the Clarity database. I decided to start with two rather large data-structures which are, today, being serialized using serde_json and written/read to SQLite as plain-text JSON. These structures are read + deserialized a number of times during contract execution.
As seems to be usual for my core refactoring attempts, this started out as a "small and isolated" change which quickly cascaded onto a much larger surface area. The PR in itself is quite large, but I hope that the changes are clear - I did make an effort on documentation :)
⛓ Consensus
While this PR does potentially affect Clarity consensus-critical storage, no behavior [should have] been changed (i.e. consensus keys, serialization, etc.). Therefore this PR should be seen as non-consensus-critical -- however an extra eye on the consensus-critical code paths is warranted, just in-case I missed something that tests aren't catching.
✅️ Issues Resolved in this PR
- #4444
- #4449
- 📌 This issue touches a lot of files, but mostly due to adding
speedy'sReadableandWritablederives to types to be (de)serialized.
- 📌 This issue touches a lot of files, but mostly due to adding
- #4450
- #4448
- [trivial] #4446
- 📌 This issue touches a lot of files, but mostly due to adding
fake'sDummyderive to types to be faked.
- 📌 This issue touches a lot of files, but mostly due to adding
- [trivial] #4447
- [trivial] #4440
- [trivial] #4445
- [trivial] #4441
- [trivial] #4442
- [trivial] #4443
🚩 Additional Changes & Improvements
-
The
AnalysisDatabasetype has been removed and baked into theClarityDatabasestruct instead. This is motivated by the fact that these are two different interfaces working against the same database and tables, and thus moving such logic together helps to reduce the perceived complexity in preparation for a larger refactor. It would have been another story IMO if there had been a significant number of analysis-related methods we were talking about -- but since it's only a few, the additional abstraction only adds complexity and confusion.- 📌 This change touches a lot of files, but primarily due to the renaming of
AnalysisDatabasetoClarityDatabaseandanalysis_dbtoclarity_db.
- 📌 This change touches a lot of files, but primarily due to the renaming of
-
This PR makes it an error to attempt to insert a contract multiple times into the backing store, even at different block hashes. Previously there was no check for this, allowing duplicate contracts to be deployed at different block heights, potentially without the caller realizing it. This situation needs to be explicitly handled now.
🧪 Testing
📌 A lot of /tests/*.rs etc. files have been touched due to renames and test-tweaking.
- Unit tests have been written for new code.
- ℹ️ Some areas are still uncovered -- I'm working on more tests for them.
- Added new tests for existing code, particularly surrounding the
RollbackWrapper. - Existing tests have been adapted to work with the new paradigm, with a few "gotcha's":
- The clarity cache (#4444) is disabled by default in
testanddevmodes. This is because too many tests fail otherwise which are relying on very specific control of what exists/doesn't exist in the underlying database. - A lot of tests use methods which now expect the contract and/or its analysis to exist in the underlying database, which now failed due to the use of
ClarityDatabase::test_insert_contract_hash()-- this method only ensured that the contract hash existed in the MARF, so it has been changed to one oftest_insert_contract()ortest_insert_contract_with_analysisin said tests. These two methods have also been added to tests prior to their "need" of this stored data. Related to #4448. - A number of tests expect to be able to insert the same contract multiple times, which is now an error. I've adapted these tests as well.
- ⚠️ Please put extra focus on reviewing all adapted tests -- it is possible that I misunderstood what the test was trying to accomplish, or simply made a mistake.
- The clarity cache (#4444) is disabled by default in
- All tests in the workspace are passing.
- A release-mode node has been partially synced to mainnet without issue.
- ℹ️ I'm working on more testing here.
⏰ Performance
The following benchmark shows a simulated single contract write / read in the current next vs. this branch. This benchmark was performed on a modern Intel CPU with an NVMe drive. Note that these benchmarks use disk-based databases and optimized includes both speedy serialization and lz4_flex compression, where next uses only serde_json serialization. No caching is used.
next |
optimized |
|
|---|---|---|
insert |
672.04 us (✅ 1.00x) |
237.78 us (🚀 2.83x faster) |
select |
295.95 us (✅ 1.00x) |
117.13 us (🚀 2.53x faster) |
ℹ️ More to come with hyperfine + stacks-inspect replay benchmarking
🚩 New Dependencies
- fake - Used to help with filling structs with random (but format-correct) data in tests. This is where the
Dummyattribute littered throughout this PR comes from. Currently added as a dependency and not dev-dependency because I had issues with conditional compiles in test/dev mode -- but it should work to get this intodev-dependencieswith a little more effort. All code usingfakeis conditional oncfg!(test), so it should be optimized away during a build anyway. - lru - Used in global caching of Clarity data, including contracts and analyses. Note that
rusqliteuses this crate, and specifically itsLruCachestruct, internally. - lz4_flex - A pure-Rust implementation of the LZ4 compression algorithm. Currently set to use safe-only code, but
default-features = falseenables its "performance mode" with unsafe code. - speedy - A very fast binary serializer written in pure Rust. This is where the
ReadableandWritableattributes littered throughout this PR come from. This is only used for node-specific, non-consensus data (i.e. data which was previously stored in the Claritymetadata_tableSQLite table):- Serialized
ContractContext - Serialized
ContractAnalysis
- Serialized
💡 Some Potential Next Steps
This PR sets the stage for further refactoring the Clarity (and other) database(s). Some ideas for next steps are things like (but not limited to):
- Optimizing how Clarity maps are stored. Maps are currently JSON-serialized in their entirety and written/read to/from the
metadata_tabledatabase table. By normalizing this structure we could perform much more efficient, targeted reads and writes for specific map keys. - Clarity lists are another contender with the same motivations as above. Lists could be normalized and operations like
index-of,element-at,sliceetc. could be delegated to the underlying database engine, which specializes in query logic, selecting only the items needed for a particular operation. - Clarity functions can be normalized to allow fast lookup and validation of contract functions prior to loading the entire contract for a contract call.
- Ensuring all binary data is stored as binary instead of its hexadecimal string representation.
- Changing serialization of the remainder of non-consensus-critical node-local data to use
speedyinstead ofserde_json. - Employ compression on stored Clarity values which are large.
- Moving on to optimizing some of the other databases, with focus on migrating binary stored as hexidecimal strings to raw bytes.
Codecov Report
:x: Patch coverage is 82.89778% with 517 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 83.40%. Comparing base (383d586) to head (ed6615e).
:warning: Report is 670 commits behind head on next.
Additional details and impacted files
@@ Coverage Diff @@
## next #4437 +/- ##
==========================================
- Coverage 83.46% 83.40% -0.06%
==========================================
Files 448 455 +7
Lines 324321 326383 +2062
==========================================
+ Hits 270698 272233 +1535
- Misses 53623 54150 +527
| Files with missing lines | Coverage Δ | |
|---|---|---|
| clarity/src/vm/analysis/arithmetic_checker/mod.rs | 92.75% <ø> (ø) |
|
| .../src/vm/analysis/contract_interface_builder/mod.rs | 95.20% <100.00%> (ø) |
|
| clarity/src/vm/analysis/mod.rs | 95.23% <100.00%> (+0.14%) |
:arrow_up: |
| clarity/src/vm/analysis/read_only_checker/mod.rs | 87.16% <100.00%> (ø) |
|
| clarity/src/vm/analysis/read_only_checker/tests.rs | 100.00% <100.00%> (ø) |
|
| clarity/src/vm/analysis/tests/mod.rs | 100.00% <ø> (ø) |
|
| clarity/src/vm/analysis/trait_checker/mod.rs | 100.00% <100.00%> (ø) |
|
| clarity/src/vm/analysis/type_checker/contexts.rs | 93.50% <100.00%> (ø) |
|
| clarity/src/vm/analysis/type_checker/mod.rs | 91.66% <100.00%> (ø) |
|
| ...src/vm/analysis/type_checker/v2_05/tests/assets.rs | 100.00% <100.00%> (ø) |
|
| ... and 74 more |
... and 23 files with indirect coverage changes
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 383d586...ed6615e. Read the comment docs.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
I realize that this PR may look intimidating. Now that I've taken the time to wrap my head around all of the changes and box-in the different moving parts it's easier for me (mentally) to break this into smaller PR's if preferred -- in which case I would break it up as follows:
- The following, in no particular order:
- #4441
- #4442
- #4443
- #4445
- #4446
- These issues together:
- #4448
- #4449
- #4450
- Followed by these, in no particular order:
- #4444
- #4440
While #4447 could be incrementally applied depending on the code changes.
I'm looking at the speedy docs and it looks like the purpose of this crate to read/write to a buffer with as little overhead as possible. Does it contain any sort of versioning info or other metadata? What happens if in the future we add/remove/reorder fields in a serializable struct?
EDIT: Have you considered rkyv? It's similar (fast, minimalistic binary protocol, not "self describing", lacks schema evolution), but much more widely used (several popular crates support it, such as hashbrown) and has some additional features, like a limited form of validation
I'm looking at the
speedydocs and it looks like the purpose of this crate to read/write to a buffer with as little overhead as possible. Does it contain any sort of versioning info or other metadata? What happens if in the future we add/remove/reorder fields in a serializable struct?
Great question!
When it comes to speedy, you can add and remove fields, but:
- to be able to deserialize a
v1 { name }into av2 { name, phone }thenphone(the added field) would need the#[speedy(default_on_eof)]attribute to indicate that it's OK if it's empty. - nothing is needed to deserialize from a
v2 { name, phone }into av1 { name }. - however, ordering must be maintained.
Meaning that if there were any ordering changes then a migration would need to be written.
EDIT: Have you considered
rkyv? It's similar (fast, minimalistic binary protocol, not "self describing", lacks schema evolution), but much more widely used (several popular crates support it, such ashashbrown) and has some additional features, like a limited form of validation
Yeah, I had a look at rkyv as well (prior to speedy) - it's a bit less straightforward, but I'd be more than happy to add a bench and test suite for adding/removing/reordering like I did with speedy :+1:
EDIT: There is actually a discussion here regarding speedy and forwards-compatibility including a potential solution. Tbh it doesn't look like there was really any thought into forwards-/backwards-compatibility in the existing [Stacks] code, but then again field names are serialized.., but this is a good discussion to have so that we make sure we account for versioning moving forward.
EDIT: Just wanted to toss in that the actual serialization implementation is only a few lines of code, so it doesn't really matter which encoding we end up with as long as it has a serde-like interface, it's trivial to swap out prior to merging -- the bigger work is done, i.e. getting binary serialization+storage in-place :smile:
I'd be more than happy to add a bench and test suite for adding/removing/reordering
There are some nice benchmarks of Rust serialization frameworks here. Both are similar in performance, and either one would be much faster than JSON, so the choice should depend on other aspects
it doesn't look like there was really any thought into forwards-/backwards-compatibility in the existing [Stacks] code, but then again field names are serialized
JSON is "self-describing" in that the field names, types (to some extent), as well as the overall structure is encoded into the output data. The extra metadata gives the application information on what the data is and what to do with it, so even if the application's internal structures change, it can use that information to ignore unrecognized fields, know if a field is missing and use a default value, or fail with a descriptive error
On the other hand, many of these fast binary formats get their speed by omitting the extra metadata. Instead, they make the assumption that the application knows exactly how the data should be structured. This limits the validation you can do, and the portability between languages, CPU architectures, and even different versions of the application. Sometimes, the trade-off worth it, depending on the use-case. What I'm specifically concerned about here is:
- If something changes in the binary format...
- Is there any chance of silent data corruption, like blindly
memcpying binary data into the wrong variables? - In the event of an explicit failure, do we get meaningful error messages, like which struct/field it failed at?
- Is there any chance of silent data corruption, like blindly
- Would this be more brittle than the current JSON encoding? Would we have to rebuild the chainstate for changes that would be backwards compatible if we were using JSON?
There are also some fast binary serialization formats that are portable and support versioning, like Cap’n Proto or FlatBuffers, but they work by generating code from a custom schema language, so there's significant added complexity (much more work than adding a #[derive(...)]). Probably not something we want to invest in right now