State sync write db
Description
Currently state sync accumulates all key-values in memory and only writes them to disk at the end.
This increases memory usage and can cause out-of-memory crash.
Also re-encoding key-values to trie nodes may be inconsistent (e.g. state v0 to v1 migration).
State sync no_proof=false mode responses contain original encoded trie nodes,
which can be recursively verified (block header state root hash) and written to database.
Related issues
- #5053
- #5862
- #4
- #5956
Integration
This PR changes implementation of StateSync component of sc-network-sync crate.
Some related interfaces are also changed and propagated, to allow StateSync import partial state into STATE database column and mark state as available:
sc-client-api:Backend.import_partial_state,BlockImportOperation.commit_complete_partial_statesc-consensus:BlockImport.import_partial_state,ImportedStatesc-network:SyncingAction::ImportPartialState,ImportResultsp-trie: fixdecode_compact- trait usage:
cumulus-client-consensus-aura,cumulus-client-consensus-common,sc-consensus-babe,sc-consensus-beefy,sc-consensus-grandpa,sc-consensus-pow,sc-client-db,sc-service
Review Notes
no_proof=falsecontain original encoded trie nodes.- Write received nodes to database using
import_partial_state. - Mark state as available after database contains all trie nodes using
commit_complete_partial_state. - Fix decoding compact trie proof into prefixed db.
- Fix in-memory backend, store all nodes in one map/dict, like db backend does.
Notes
- todo: Code comments and documentation?
- need-help: Not sure about naming types/functions/...
- need-help: Should I reuse existing crate/trait
ProofProviderHashDb, or where to add it? - assume: State sync
importcan't abort sync due to invalid state root or database write error.
Checklist
- [x] My PR includes a detailed description as outlined in the "Description" and its two subsections above.
- [x] My PR follows the labeling requirements of this project (at minimum one label for
Trequired) - [ ] I have made corresponding changes to the documentation (if applicable)
- [ ] I have added tests that prove my fix is effective or that my feature works (if applicable)
- [x] I tested by state syncing one local network node from another.
- [x] I tested trie with state v1 hashed values.
- [x] I tested trie with
:child_storage:default:. - [x] I tested with responses containing one new key-value, syncing one key at a time to validate transitions.
- [x] I tested by state syncing Astar (astar-collator) with low memory usage
- [x] I tested by state syncing one local network node from another.
Bot Commands
/cmd label T0-node
Hey, thank you for the pull request. If you want to work on this, please check out my comment and the work for that was already started here.
Hey, thank you for the pull request. If you want to work on this, please check out my comment and the work for that was already started here.
add new keys to the same state and recalculate the state root
Rebuilding trie from key-values may change trie structure and root hash.
This has already happened on live network during StateVersion V0->V1 migration.
Before the migration, on StateVersion V0, all values were stored as inline values.When the migration began, runtime api already reported StateVersion V1, meaning that values longer than 32 bytes should be hashed and stored in separate nodes.
But old values were still stored as inline values.
So rebuilding
"key"="long ... value" from V0 {"prefix":"key","value":"long ... value"} into V1 {"prefix":"key","value":{"hash":"..."}} would change root hash.Migration script iterated keys in lexicographic order and wrote them back as V1.
But other runtime pallets could insert/overwrite their keys as V1 during migration.
So trie was inconsistent.
State sync response proofs are originally encoded trie nodes with matching hashes.
Reusing them instead of rebuilding would allow syncing even inconsistent trie.
Ignoring nodes which are already stored in db makes state sync incremental and may speed it up.
Storing received nodes of incomplete trie in db allows to resume sync after process restarts.
Child nodes are stored first, root node is stored last, to ensure that node dependencies already exist in db.
During sync some child nodes don't have parent node referencing them in db yet.
Their hashes may be stored in db for garbage collection.
This PR was backported and tested on Astar, which had issue with OOM during parachain state sync.
Logs and RAM usage was collected, and following plot suggests that modified state sync doesn't increase memory usage.
State sync response proofs are originally encoded trie nodes with matching hashes. Reusing them instead of rebuilding would allow syncing even inconsistent trie.
Not sure what you are saying here?
The state proofs are already "proofs", this means all the nodes from the storage root down to the leaves that contain the actual data. If we take these nodes and stick them directly into the db, we don't need to recalculate or anything else, because we get exactly the nodes.
This has already happened on live network during StateVersion V0->V1 migration.
Not sure how this is related here, as we download the nodes directly.
Not sure what you are saying here? take these nodes and stick them directly into the db, we don't need to recalculate
Yes
Not sure how this is related here, as we download the nodes directly.
Currently polkadot-sdk:
- Requests proof consisting of encoded trie nodes.
- Checks node hashes with
verify_range_proof. - Converts nodes to key-value with
verify_range_proof. - Rebuilds trie from key-values with
reset_storage.
So received nodes are not used, but rebuilt/renecoded from key-value, which can cause problems.
So received nodes are not used, but rebuilt/renecoded from key-value, which can cause problems.
My point being that we directly forward these trie nodes to the db, as it was already started here: https://github.com/paritytech/polkadot-sdk/pull/5956
Thanks, checked #5956 again.
Don't see complete changes yet:
StateImporteris not used yet.import_statestill acceptsStorage(key-value).
Your review comment suggests to forward proof PrefixedMemoryDB to import_state.
But this batch shouldn't be merged into db completely.
Also nodes should be inserted in reverse topological order.
Example:
// Root node referencing two leaves
root1 -> leaf1, leaf2
// No nodes in db
db == []
// Request first leaf
response1 == [root1, leaf1]
// Derive next request prefix
a. Insert whole proof into db
// Insert root node and first leaf
db == [root1, leaf1]
a1. Sync continues
// Request second leaf
response2 == [root1, leaf2]
// Insert root node (duplicate) and second leaf
db == [root1, leaf1, leaf2]
// All nodes in db, sync complete
a2. Process restarts, sync restarts
// Root node is already in database,
// so sync is considered complete,
// but db is missing second leaf.
db == [root1, leaf1, (leaf2)]
b. Insert node if it's dependencies are already in db
// First leaf doesn't depend on anything, insert
db == [leaf1]
// Root node still depends on second leaf, don't insert yet
b1. Sync continues
// Continue sync
b2. Process restarts, sync restarts
// Root node is not in database,
// so sync should resume
// Request log(N) branches, will skip prefixes of nodes already in db
// May cache stack of not yet inserted nodes to reduce requests after restart
response1 == [root1, leaf1]
...
// Request second leaf
response2 == [root1, leaf2]
// Second leaf doesn't depend on anything, insert
db == [leaf1, leaf2]
// Now all root dependencies are in db, insert
db == [leaf1, leaf2, root]
// All nodes in db, sync complete
If there are some problems with syncing child storage,
they may be related to missing child_storage_root_hash prefix in db key.
db[nibble_prefix + node_hash] = node
db[child_storage_root_hash + nibble_prefix + node_hash] = child_storage_node
But this batch shouldn't be merged into db completely.
Why? All the nodes that are part of the proof, are part of the original trie. Why should we not merge all these nodes into the db?
Also nodes should be inserted in reverse topological order.
I don't get why the order is important here. Again, I'm basically just saying that we write the nodes with H(Node) => Node into the backend.
why not merge all? why order is important?
Example shows that merging whole proof may break database.
In that example there are three nodes:
root node with hash root1,
and it's two child leaves with hashes leaf1 and leaf2.
When state sync starts,
root1 is not yet in database,
so sync is not complete.
After receiving first proof with root1 and leaf1 nodes,
they may be merge into db.
-
(whole proof) After merging whole first proof,
db containsroot1andleaf1.
If process restarts after merging first proof,
it will see thatroot1is already in db,
and consider state sync completed.
But it didn't receiveleaf2yet, so db doesn't contain whole trie. -
(order) If process stops between inserting
root1andleaf1nodes,
db would containroot1, but notleaf1.
Again, ifroot1is in db, state sync is considered complete.
I assume that state sync doesn't recurse into brach, if that branch hash is already in db, i.e. db contains whole subtree under that branch.
- If process restarts after merging first proof, it will see that
root1is already in db, and consider state sync completed.
We can just store that we did not yet finished the state sync. Right now we don't support a restart any way.