oasis-core icon indicating copy to clipboard operation
oasis-core copied to clipboard

Optimize runtime iterative state sync

Open martintomazic opened this issue 6 months ago • 2 comments

Currently, it takes days (Sapphire, mainnet) for iterative state sync to finish.

State sync should be benchmarked and optimized.

Possible bottlenecks:

  1. Default fetcher pool has 4 workers (see).
    • Possibly increase, also bottleneck could be number of peers for diff fetching.
  2. Fetching, applying diffs and finalizing versions (see) are three completely independent processes.
    • Currently this is not implemented optimally and processes may block each other.
    • See original discussion here.

martintomazic avatar Jun 27 '25 14:06 martintomazic

Update:

I postponed this a bit due to a more important work on pruning/badger reclaiming space. Will start moving this forward in the coming days, especially the first PR that is already quite mature and introduces useful fixes.

Plan of attack

  1. #6306
    • Pass context explicitly, which exposed some leaking resources.
    • Fix: Panicking, deadlocking on the clean-up, triggering genesis checkpoint to early.
    • Status: Needs to be rebased and minor fixes.
  2. #6308
    • Motivation: SRP-> simplifies parallelization and enables to test in isolation.
    • Design decision:
      • Granular packages or internal workers https://github.com/oasisprotocol/oasis-core/pull/6308#issuecomment-3229639539?
      • Should we write a test suite for every extracted worker?
      • Maybe PR per extracted worker, especially if we decide to write test suites.
    • Status: WIP, needs rebase, POC for now.
  3. #6307
    • https://github.com/oasisprotocol/oasis-core/pull/6299#discussion_r2292274149
    • Status: Heavy WIP.
  4. #6241 - Optimize.
    • We could start working on this directly after 2..

Misc: Once checkpoint sync is finished, we should close the checkpoint sync p2p client and the manager that keeps on tracking peers of the corresponding protocol, as 1. this is resource leak and 2. it unnecessarily overwhelms p2p network.

martintomazic avatar Sep 09 '25 13:09 martintomazic

This task is currently on hold but for the future reference:

  1. https://github.com/oasisprotocol/oasis-core/issues/6356 would probably be the most impactful.
  2. If 1. is not enough, runtime state sync should be benched, possibly fetcher count increased and as a last resource we could consider optimizing the syncing part.

https://github.com/oasisprotocol/oasis-core/issues/6307 seems like a prerequisite for both.

martintomazic avatar Nov 13 '25 23:11 martintomazic