nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

Zero-downtime restart and upgrade procedures

Open zah opened this issue 3 years ago • 4 comments

Software updates and planned restarts should be handled with zero-downtime. This is a challenge because deploying a new installation takes time and any restart is associated with reloading at least some run-time state from the database and reconnecting to the network .

To solve this problem and to address the planned long-term merge of our Eth1 and Eth2 clients, we can introduce a new scheme for our distributed binaries:

  1. There is a single binary called nimbus used to launch and control other processes.
  2. Specialized binaries for different functions such as beacon_node, validator_client and eth1_node exists in versioned subfolders.
  3. The upgrade procedure consists of deploying new versioned binaries and launching a hand-off procedure.
  4. The hand-off procedure starts the new binaries and allows them to sync with the network before assigning any validator duties to them.
  5. Once the new nodes are synced, the validator duties are re-assigned through a safe two-phase commit protocol.
  6. The user can roll-back to a previous version quickly in case of problems

zah avatar Aug 20 '20 13:08 zah

This seems to introduce a lot of complexity that has a number of existing partial solutions for a relatively small benefit, if we're to implement it fully - generally, forwards and backwards compatibility is needed for a small piece of the software: slashing protection database and validator keys mainly.

Most upgrades don't touch the database format and don't need a new sync, thus this complicated infrastructure is only occasionally needed.

It's relatively easy to start a new node, so what's really needed is a way to transfer keys & slashing protection reliably - the rest can already be solved with existing package managers (like nix), or simply by keeping installations in separate folders.

Writing an orchestrator of this sort is something that can easily balloon into a fully fledged system monitor a la systemd or similar offerings which is way out of scope for the project.

arnetheduck avatar Aug 20 '20 16:08 arnetheduck

I think with over 1 year of production hindsight, the most important thing was for Nimbus to restart fast enough and for Nimbus to display the next validator duty time so that users can choose a safe window.

Only part left is dealing with sync committee duty (#3281)

mratsim avatar Mar 09 '22 10:03 mratsim

Development of this feature is still part of our GUI-only user experience roadmap. It's about having a simple command that the user can execute without worrying for missed attestations.

zah avatar Mar 09 '22 10:03 zah

https://github.com/status-im/infra-role-beacon-node-linux/commit/558b4069 provides an example of how to do this.

tersec avatar Jun 21 '22 14:06 tersec