espresso-sequencer
espresso-sequencer copied to clipboard
Test framework for restartability
Test framework
Set up some Rust automation for tests that spin up a sequencer network and restart various combinations of nodes, checking that we recover liveness. Instantiate the framework with several combinations of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992.
There are many things left to test here, including:
- Test all nodes restarting at the same time (this is the one case that doesn't work yet, requires all nodes to store state which depends on https://github.com/EspressoSystems/hotshot-query-service/issues/664)
- Checking integrity of the DA/query service during and after restart
These can be done in follow-up work
I considered doing this with something more dynamic like Bash or Python scripting, leaning on our existing docker-compose or process-compose infrastructure to spin up a network. I avoided this for a few reasons:
- process-compose is annoying to script and in particular has limited capabilities for shutting down and starting up processes
- both docker-compose and process-compose make it hard to dynamically choose the network topology
- once the basic test infrastructure is out of the way, Rust is far easier to work with for writing new checks and assertions. For example, checking for progress is way easier when we can plug directly into the HotShot event stream, vs subscribing to some stream via HTTP and parsing responses with jq
Storage refactor
The tests where all nodes restart together exposed an issue. There was a race condition where DA nodes may decide on a block, but shut down before they have updated their query storage. After the restart, they will have lost the corresponding block payload and VID info from the in-memory HotShot data structures, and they will never get it back.
This PR refactors the way query storage is updated: instead of having separate, unsynchronized event handling tasks for updating consensus storage (e.g. storing DA proposals) and updating query storage, we now have just a single task for populating consensus storage, and query storage is populated from consensus storage. This means we no longer rely at all on in-memory payload storage in order to populate our query service. This makes DA certs much more meaningful, since DA votes are conditioned on successfully storing the DA proposal in consensus storage. We can also now remove the PayloadStore from HotShot.
For the SQL backend, query service population is coupled with consensus storage garbage collection, so that we can delete old data and collect it/move it to archival storage in an atomic transaction.
Key places to review
- New test suite:
sequencer/src/restart_tests.rs - Storage refactor: primarily in
types/src/v0/traits.rs,sequencer/src/persistence.rs,sequencer/src/persistence/sql.rs