polkadot-sdk icon indicating copy to clipboard operation
polkadot-sdk copied to clipboard

State sync write db

Open turuslan opened this issue 5 months ago • 10 comments

Description

Currently state sync accumulates all key-values in memory and only writes them to disk at the end. This increases memory usage and can cause out-of-memory crash. Also re-encoding key-values to trie nodes may be inconsistent (e.g. state v0 to v1 migration). State sync no_proof=false mode responses contain original encoded trie nodes, which can be recursively verified (block header state root hash) and written to database.

Related issues

  • #5053
  • #5862
  • #4
  • #5956

Integration

This PR changes implementation of StateSync component of sc-network-sync crate. Some related interfaces are also changed and propagated, to allow StateSync import partial state into STATE database column and mark state as available:

  • sc-client-api: Backend.import_partial_state, BlockImportOperation.commit_complete_partial_state
  • sc-consensus: BlockImport.import_partial_state, ImportedState
  • sc-network: SyncingAction::ImportPartialState, ImportResult
  • sp-trie: fix decode_compact
  • trait usage: cumulus-client-consensus-aura, cumulus-client-consensus-common, sc-consensus-babe, sc-consensus-beefy, sc-consensus-grandpa, sc-consensus-pow, sc-client-db, sc-service

Review Notes

  • no_proof=false contain original encoded trie nodes.
  • Write received nodes to database using import_partial_state.
  • Mark state as available after database contains all trie nodes using commit_complete_partial_state.
  • Fix decoding compact trie proof into prefixed db.
  • Fix in-memory backend, store all nodes in one map/dict, like db backend does.

Notes

  • todo: Code comments and documentation?
  • need-help: Not sure about naming types/functions/...
  • need-help: Should I reuse existing crate/trait ProofProviderHashDb, or where to add it?
  • assume: State sync import can't abort sync due to invalid state root or database write error.

Checklist

  • [x] My PR includes a detailed description as outlined in the "Description" and its two subsections above.
  • [x] My PR follows the labeling requirements of this project (at minimum one label for T required)
  • [ ] I have made corresponding changes to the documentation (if applicable)
  • [ ] I have added tests that prove my fix is effective or that my feature works (if applicable)
    • [x] I tested by state syncing one local network node from another.
      • [x] I tested trie with state v1 hashed values.
      • [x] I tested trie with :child_storage:default:.
      • [x] I tested with responses containing one new key-value, syncing one key at a time to validate transitions.
    • [x] I tested by state syncing Astar (astar-collator) with low memory usage

Bot Commands

/cmd label T0-node

turuslan avatar Jul 17 '25 11:07 turuslan

Hey, thank you for the pull request. If you want to work on this, please check out my comment and the work for that was already started here.

bkchr avatar Jul 18 '25 10:07 bkchr

Hey, thank you for the pull request. If you want to work on this, please check out my comment and the work for that was already started here.

add new keys to the same state and recalculate the state root

Rebuilding trie from key-values may change trie structure and root hash.

This has already happened on live network during StateVersion V0->V1 migration. Before the migration, on StateVersion V0, all values were stored as inline values.
When the migration began, runtime api already reported StateVersion V1, meaning that values longer than 32 bytes should be hashed and stored in separate nodes.
But old values were still stored as inline values.
So rebuilding "key"="long ... value" from V0 {"prefix":"key","value":"long ... value"} into V1 {"prefix":"key","value":{"hash":"..."}} would change root hash.
Migration script iterated keys in lexicographic order and wrote them back as V1.
But other runtime pallets could insert/overwrite their keys as V1 during migration.
So trie was inconsistent.

State sync response proofs are originally encoded trie nodes with matching hashes.
Reusing them instead of rebuilding would allow syncing even inconsistent trie.

Ignoring nodes which are already stored in db makes state sync incremental and may speed it up.
Storing received nodes of incomplete trie in db allows to resume sync after process restarts.
Child nodes are stored first, root node is stored last, to ensure that node dependencies already exist in db.

During sync some child nodes don't have parent node referencing them in db yet.
Their hashes may be stored in db for garbage collection.

turuslan avatar Sep 11 '25 11:09 turuslan

This PR was backported and tested on Astar, which had issue with OOM during parachain state sync.
Logs and RAM usage was collected, and following plot suggests that modified state sync doesn't increase memory usage.

image

turuslan avatar Sep 11 '25 12:09 turuslan

State sync response proofs are originally encoded trie nodes with matching hashes. Reusing them instead of rebuilding would allow syncing even inconsistent trie.

Not sure what you are saying here?

The state proofs are already "proofs", this means all the nodes from the storage root down to the leaves that contain the actual data. If we take these nodes and stick them directly into the db, we don't need to recalculate or anything else, because we get exactly the nodes.

This has already happened on live network during StateVersion V0->V1 migration.

Not sure how this is related here, as we download the nodes directly.

bkchr avatar Sep 11 '25 13:09 bkchr

Not sure what you are saying here? take these nodes and stick them directly into the db, we don't need to recalculate

Yes

Not sure how this is related here, as we download the nodes directly.

Currently polkadot-sdk:

  1. Requests proof consisting of encoded trie nodes.
  2. Checks node hashes with verify_range_proof.
  3. Converts nodes to key-value with verify_range_proof.
  4. Rebuilds trie from key-values with reset_storage.

So received nodes are not used, but rebuilt/renecoded from key-value, which can cause problems.

turuslan avatar Sep 11 '25 14:09 turuslan

So received nodes are not used, but rebuilt/renecoded from key-value, which can cause problems.

My point being that we directly forward these trie nodes to the db, as it was already started here: https://github.com/paritytech/polkadot-sdk/pull/5956

bkchr avatar Sep 11 '25 20:09 bkchr

Thanks, checked #5956 again.

Don't see complete changes yet:

  • StateImporter is not used yet.
  • import_state still accepts Storage (key-value).

Your review comment suggests to forward proof PrefixedMemoryDB to import_state.
But this batch shouldn't be merged into db completely.
Also nodes should be inserted in reverse topological order.
Example:

// Root node referencing two leaves
root1 -> leaf1, leaf2
// No nodes in db
db == []

// Request first leaf
response1 == [root1, leaf1]
// Derive next request prefix

a. Insert whole proof into db
  // Insert root node and first leaf
  db == [root1, leaf1]

  a1. Sync continues
    // Request second leaf
    response2 == [root1, leaf2]

    // Insert root node (duplicate) and second leaf
    db == [root1, leaf1, leaf2]

    // All nodes in db, sync complete

  a2. Process restarts, sync restarts
    // Root node is already in database,
    // so sync is considered complete,
    // but db is missing second leaf.
    db == [root1, leaf1, (leaf2)]

b. Insert node if it's dependencies are already in db
  // First leaf doesn't depend on anything, insert
  db == [leaf1]
  // Root node still depends on second leaf, don't insert yet

  b1. Sync continues
    // Continue sync

  b2. Process restarts, sync restarts
    // Root node is not in database,
    // so sync should resume

    // Request log(N) branches, will skip prefixes of nodes already in db
    // May cache stack of not yet inserted nodes to reduce requests after restart
    response1 == [root1, leaf1]
    ...

  // Request second leaf
  response2 == [root1, leaf2]

  // Second leaf doesn't depend on anything, insert
  db == [leaf1, leaf2]
  // Now all root dependencies are in db, insert
  db == [leaf1, leaf2, root]

  // All nodes in db, sync complete

If there are some problems with syncing child storage,
they may be related to missing child_storage_root_hash prefix in db key.

db[nibble_prefix + node_hash] = node
db[child_storage_root_hash + nibble_prefix + node_hash] = child_storage_node

turuslan avatar Sep 12 '25 06:09 turuslan

But this batch shouldn't be merged into db completely.

Why? All the nodes that are part of the proof, are part of the original trie. Why should we not merge all these nodes into the db?

Also nodes should be inserted in reverse topological order.

I don't get why the order is important here. Again, I'm basically just saying that we write the nodes with H(Node) => Node into the backend.

bkchr avatar Sep 12 '25 07:09 bkchr

why not merge all? why order is important?

Example shows that merging whole proof may break database.

In that example there are three nodes:
root node with hash root1,
and it's two child leaves with hashes leaf1 and leaf2.
When state sync starts,
root1 is not yet in database,
so sync is not complete.
After receiving first proof with root1 and leaf1 nodes,
they may be merge into db.

  • (whole proof) After merging whole first proof,
    db contains root1 and leaf1.
    If process restarts after merging first proof,
    it will see that root1 is already in db,
    and consider state sync completed.
    But it didn't receive leaf2 yet, so db doesn't contain whole trie.

  • (order) If process stops between inserting root1 and leaf1 nodes,
    db would contain root1, but not leaf1.
    Again, if root1 is in db, state sync is considered complete.

I assume that state sync doesn't recurse into brach, if that branch hash is already in db, i.e. db contains whole subtree under that branch.

turuslan avatar Sep 12 '25 08:09 turuslan

  • If process restarts after merging first proof, it will see that root1 is already in db, and consider state sync completed.

We can just store that we did not yet finished the state sync. Right now we don't support a restart any way.

bkchr avatar Sep 12 '25 19:09 bkchr