zksync-era
zksync-era copied to clipboard
feat(state-keeper): implement asynchronous RocksDB cache
What ❔
This PR adds a stateful storage abstraction CachedStorage that encapsulates both RocksDB and Postgres behind short-lived Box<dyn StorageRead> connections. Internally CachedStorage keeps track if RocksDB is behind Postgres and tries to catch it up in the background if so.
Apologies if the code is a little hairy, I tried to preserve the existing invariants as much as possible to limit the impact of this on other parts of the codebase.
Tested this locally by running load tests for 3 hours (~3k miniblocks), deleting RocksDB state and then restarting. Seems to be working fine, but it catches up really fast and I am not sure how to generate more data faster. So I am hoping to test this on testnet/staging with one of the external nodes.
Also trying to come up with a way to unit test this so that is WIP for now. In the meanwhile any feedback is welcome!
Note: metric for RocksDB-Postgres lag already exists
Why ❔
The main design goal here was to ensure liveliness of the system by providing a way to get a ReadStorage implementation while avoiding blocking operations as much as possible. If RocksDB ever gets wiped for whatever reason, State Keeper will keep working on Postgres-backed StorageRead connections for a few hours while CachedStorage is trying to catch RocksDB up.
Checklist
- [x] PR title corresponds to the body of PR (we generate changelog entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via
zk fmtandzk lint. - [x] Spellcheck has been run via
zk spellcheck. - [x] Linkcheck has been run via
zk linkcheck.
@RomanBrodetski @slowli I have taken your comments into consideration and refactored the code according to them. Specifically, I am now operating under presumption that RocksDB can never fall more than one block behind after catching up which indeed simplified a lot of things.
I have also made the solution a bit more generic by introducing a trait ReadStorageFactory that regulates which ReadStorage implementations can be returned which is good for testing - I made so that we test some features specifically on Postgres-based storage or specifically on RocksDB-based storage. Also allowed me to write a test that can mimic RocksDB falling behind by executing a bunch of txs on Postgres first. Hopefully it did not make the PR too overengineered :)
PTAL when you can!