zksync-era icon indicating copy to clipboard operation
zksync-era copied to clipboard

feat(state-keeper): implement asynchronous RocksDB cache

Open itegulov opened this issue 1 year ago • 2 comments

What ❔

This PR adds a stateful storage abstraction CachedStorage that encapsulates both RocksDB and Postgres behind short-lived Box<dyn StorageRead> connections. Internally CachedStorage keeps track if RocksDB is behind Postgres and tries to catch it up in the background if so.

Apologies if the code is a little hairy, I tried to preserve the existing invariants as much as possible to limit the impact of this on other parts of the codebase.

Tested this locally by running load tests for 3 hours (~3k miniblocks), deleting RocksDB state and then restarting. Seems to be working fine, but it catches up really fast and I am not sure how to generate more data faster. So I am hoping to test this on testnet/staging with one of the external nodes.

Also trying to come up with a way to unit test this so that is WIP for now. In the meanwhile any feedback is welcome!

Note: metric for RocksDB-Postgres lag already exists

Why ❔

The main design goal here was to ensure liveliness of the system by providing a way to get a ReadStorage implementation while avoiding blocking operations as much as possible. If RocksDB ever gets wiped for whatever reason, State Keeper will keep working on Postgres-backed StorageRead connections for a few hours while CachedStorage is trying to catch RocksDB up.

Checklist

  • [x] PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • [ ] Tests for the changes have been added / updated.
  • [x] Documentation comments have been added / updated.
  • [x] Code has been formatted via zk fmt and zk lint.
  • [x] Spellcheck has been run via zk spellcheck.
  • [x] Linkcheck has been run via zk linkcheck.

itegulov avatar Feb 27 '24 05:02 itegulov

@RomanBrodetski @slowli I have taken your comments into consideration and refactored the code according to them. Specifically, I am now operating under presumption that RocksDB can never fall more than one block behind after catching up which indeed simplified a lot of things.

I have also made the solution a bit more generic by introducing a trait ReadStorageFactory that regulates which ReadStorage implementations can be returned which is good for testing - I made so that we test some features specifically on Postgres-based storage or specifically on RocksDB-based storage. Also allowed me to write a test that can mimic RocksDB falling behind by executing a bunch of txs on Postgres first. Hopefully it did not make the PR too overengineered :)

PTAL when you can!

itegulov avatar Mar 06 '24 08:03 itegulov

No performance difference detected (anymore)

github-actions[bot] avatar Mar 26 '24 22:03 github-actions[bot]