Optimize checkpoint creation
Currently checkpoint creation can be rather slow and we should optimize it, especially as the state grows.
Optimize Checkpoint Creation Design
Context
Checkpoint is state dump at a given height, that facilitates state sync. It consists of many chunks to enable parallel and partial downloads from the p2p network. Chunks are serialized and compressed merkle proofs, which prevent malicious nodes spamming with invalid data.
Creation is currently optional (altruistic). To achieve deterministic output, yet some level of configuration (e.g. chunks size), some parameters are stored as part of runtime/consensus policy.
Motivation
Creating a checkpoint for latest height of Sapphire Mainnet, with state pruning enabled is done in a matter of minutes. If no pruning is set, checkpoint creation becomes a matter of hours (bigger overall state -> slower reads).
Creation needs to be fast and optimal, so that we can enable it on bulkier nodes, with possible smaller creation interval.
Requirements
Checkpoint creation should be:
- R1:
Ndbimplementation agnostic. - R2: Deterministic.
- R3: Partial and provable, to facilitate fetching from untrusted p2p network.
- R4: Interoperable, i.e. network may consists of peers broadcasting different versions at the same time.
- Mixing chunks from different versions is of course not allowed.
- R5: Introducing new version should not introduce availability issues/overload:
- If we have only few peers per checkpoint already, further splitting it may be problematic.
Existing Implementation
Currently, chunks are created by iterating over keyset for the given height. As the iterator's proofbuilder exceeds the chunk_size limit, a chunk is written into file. Next, we create new iterator, starting with the previous offset and repeat procedure until no keys are left.
Internally, this iteration triggers in-order traversal of the state trie, resulting in repeated calls to Badger's txn.Get.
Benchmarks and Profiling
Reproducibility
CPU: 12th Gen Intel i7-12800H (20) @ 4.700GHz GPU: Intel Alder Lake-P Memory: 31760MiB Host: ThinkPad P1 Gen 5 OS: Arch Linux x86_64
Manual Benchmarks
Sapphire Mainnet synced from the genesis (no pruning) up to height ~5.5M. Pathbadger was used as the underlying nodedb implementation.
- Creating a checkpoint manually (node not running) at height
~5.5Mtook~105min.- At this height checkpoint size is
~4.1GBand corresponds to~7GBof state if restored into fresh db instance.
- At this height checkpoint size is
- Creating a checkpoint for the same height whilst syncing (node running) seems to run
~200-300min. - If creating checkpoints whilst syncing, the time to chunk equivalent state is increasing.
- As state grows, Badger (LSM Tree) has more and bigger SSTs, meaning read operation are becoming slower.
Profiling
- Profiling checkpoint creation showed that
~85-90%of memory and CPU can be attributed to badger'stxn.Get. - Surprisingly, defensive
checkRootExists(againtxn.Get) accounts for~12%of all allocated memory (~60/450GB).- Commenting defensive check or reading in batch, reduces allocation, but no performance gain.
Proposed Design
Looking at the profiling section it is clear the current bottleneck is txn.Get, i.e. we are hitting Badger's sequential read limit.
We may be tempted to use Badger's native Iterator, however neither badger nor pathbadger nodedb implementation stores all keys of a given version under the same prefix. Moreover, neither of them has keys (nodes) sorted in the trie's in-order traversal. This is for a good reason, else every version would need to copy all shared nodes and order them.
We are left with two options:
- Tune BadgerDB configuration, to make read operations more performant and or replace with faster database.
- Parallelize checkpoint creation, given this is IO heavy problem.
Remaining proposal focuses on 2., as 1. is out of scope/would have wider implication on the whole node performance.
Parallel Checkpoint Creation
Challenges
- We cannot assume state trie is balanced, even for EVM-based paratimes (all EVM keys are 32 bytes) :
- Paratimes place their own prefixes.
- There is additional state/accounts outside of the EVM.
- Currently
mkvs.Treedoes not support notion ofSubtreeorSubtreeIterator.
Solution
Instead of relying on iterator, we may want to do a manual in-order traversal of nodedb for the given height, starting with root. A parallel algorithm for such traversal, was prototyped as part of the #6204 , adhering to R1, R2, and R3, producing 5x speed-up.
Implementation and Validation Plan
The idea is to split this task into three separate issues:
- #6216 :
- update: Likely this will be avoided as we don't need to release new version: https://github.com/oasisprotocol/oasis-core/pull/6204#issuecomment-2964592470
- #6218:
- ~#6165~
- #6204
- Fuzzers are also introduced as part of this PR, to account for more complex logic. We may get them in earlier.
- #6219 by tackling R4 and R5.
Deployment Plan / Testnet Testing
?
The plan seems good to me. Can you create sub-issues for the above?
For deployment/testing, the important thing is that this can be rolled out incrementally without causing problems. E.g. something like the following:
- Add support for restoring from the new version of checkpoints.
- Still create new checkpoints using the old version by default.
- Prioritize the new version when sorting checkpoints to use.
- After most nodes are upgraded to the new version that supports the new checkpoint format, switch to creating new checkpoints using the new version.
- After a while, remove support for the old version.
Add support for restoring from the new version of checkpoints.
I would say restoration is already version agnostic. Furthermore with version being part of the checkpoint hash you already cannot mix chunks from two incompatible versions. Bookkeeping and dir structure however is not. (#6216 )
Prioritize the new version when sorting checkpoints to use.
From the client perspective, we should only care that checkpoint we choose has enough peers serving it? v1 as well as new v2, produce checkpoint of same size (just different chunks/order). Performance wise for the client doesn't really matter.
Alternative is that chunking algorithm version becomes part of consensus parameters, like chunk size so that is configurable in the policy? A lighter approach I was thinking is to make it configurable via the config file, eventually defaulting to latest version like nodedb. This will be useful for e2e tests, as well as incremental deployment. (issue 3)
Will create issues :)
From the client perspective, we should only care that checkpoint we choose has enough peers serving it?
True, this prioritization on the client doesn't really matter.
Alternative is that chunking algorithm version becomes part of consensus parameters, like chunk size so that is configurable in the policy? A lighter approach I was thinking is to make it configurable via the config file, eventually defaulting to latest version like nodedb. This will be useful for e2e tests, as well as incremental deployment. (issue 3)
My thinking was that this would be an optional configuration parameter. The default would then change across releases as mentioned above.
For the future, if we want to optimize further, we could take some inspiration from snap sync (repo and blog post).
The idea would be:
- Additional key value db instance that maintains sorted key value pairs for all leaves of the latest version (no mkvs): a. Sideservice that "applies" writelog for every finalized version to this db instance -> only leaves for latest version are kept.
- Creating checkpoint chunks means pausing 1.a and:
a. Iterate using native badger db iterator (order of magnitude faster) over key value db above.
b. Limit number of leaves (chunk size).
c. Generate mkvs subtree in memory using entries from 2.b -> get subroot hash, that points to
mkvs(internal node)? d. Fetch merkle proof for the subroot frommkvs? - Chunk format is thus changed to writelog + merkle proof.
- Client creates a subtree by applying writelog and only validates merkle proof for the subroot for every chunk.
Proposed solution is much simplified to the protocol linked above, i.e. we still want to keep checkpointing at a fixed interval.
At a quick look it seems we already have all the building blocks, unless major fallacy in the proposed logic above. This would adhere to R1-3.
Yeah this makes sense to me.
In general you could split the mkvs into two dbs: one for storing the tree nodes (without leaves) needed for quickly building proofs and one for storing the leaves.