hydra Hydra network failure resilience

Hydra network failure resilience

Open ch1bo opened this issue 3 years ago • 1 comments

trafficstars

What & Why

The hydra-node currently does not re-submit network messages to other hydra-nodes. This has the effect, that if the connection between two hydra-nodes breaks down while processing transctions, the Head will stall because of missing responses and needs to be restarted.

This feature will address this problem by

Try to re-establish network connections in a reliable manner
Re-submit past messages (up to some point) once re-connected

As a consequence, short outages can be handled gracefully and the Head can continue to process transactions once re-established.

Out of scope: Longer down times (depending on the contestation period Hydra protocol parameter) are not covered!

Requirements

Connections to other hydra-nodes are tried to re-establish
Network messages are re-sent, such that the Head can progress if it has not been interrupted by longer than TBD (number of messages, time?).

To be discussed

How often do we expect connections to just die away?
How long do we re-submit network messages?
Prevent submitting new transactions while not connected to all (configured?) peers
Doing this in a pull-based manner might be more meaningful
What about the HeadLogic, will it loop on old message requests? How long to re-enqueue messages (also applies to Wait outcomes)?

Jan 30 '22 17:01 ch1bo

We have not experienced this much (might be more relevant if we would have UDP transport) and have no concrete user requests for this -> prioritize lower, still aiming 1.0.0 but maybe not.

May 31 '22 12:05 ch1bo

hydra hydra copied to clipboard

Hydra network failure resilience

What & Why

Requirements

To be discussed

hydra
hydra copied to clipboard