hydra icon indicating copy to clipboard operation
hydra copied to clipboard

Hydra network failure resilience

Open ch1bo opened this issue 3 years ago • 1 comments
trafficstars

What & Why

The hydra-node currently does not re-submit network messages to other hydra-nodes. This has the effect, that if the connection between two hydra-nodes breaks down while processing transctions, the Head will stall because of missing responses and needs to be restarted.

This feature will address this problem by

  • Try to re-establish network connections in a reliable manner
  • Re-submit past messages (up to some point) once re-connected

As a consequence, short outages can be handled gracefully and the Head can continue to process transactions once re-established.

Out of scope: Longer down times (depending on the contestation period Hydra protocol parameter) are not covered!

Requirements

  • Connections to other hydra-nodes are tried to re-establish
  • Network messages are re-sent, such that the Head can progress if it has not been interrupted by longer than TBD (number of messages, time?).

To be discussed

  • How often do we expect connections to just die away?
  • How long do we re-submit network messages?
  • Prevent submitting new transactions while not connected to all (configured?) peers
  • Doing this in a pull-based manner might be more meaningful
  • What about the HeadLogic, will it loop on old message requests? How long to re-enqueue messages (also applies to Wait outcomes)?

ch1bo avatar Jan 30 '22 17:01 ch1bo

We have not experienced this much (might be more relevant if we would have UDP transport) and have no concrete user requests for this -> prioritize lower, still aiming 1.0.0 but maybe not.

ch1bo avatar May 31 '22 12:05 ch1bo