hydra
hydra copied to clipboard
Hydra network failure resilience
What & Why
The hydra-node currently does not re-submit network messages to other hydra-nodes. This has the effect, that if the connection between two hydra-nodes breaks down while processing transctions, the Head will stall because of missing responses and needs to be restarted.
This feature will address this problem by
- Try to re-establish network connections in a reliable manner
- Re-submit past messages (up to some point) once re-connected
As a consequence, short outages can be handled gracefully and the Head can continue to process transactions once re-established.
Out of scope: Longer down times (depending on the contestation period Hydra protocol parameter) are not covered!
Requirements
- Connections to other
hydra-nodes are tried to re-establish - Network messages are re-sent, such that the Head can progress if it has not been interrupted by longer than TBD (number of messages, time?).
To be discussed
- How often do we expect connections to just die away?
- How long do we re-submit network messages?
- Prevent submitting new transactions while not connected to all (configured?) peers
- Doing this in a pull-based manner might be more meaningful
- What about the HeadLogic, will it loop on old message requests? How long to re-enqueue messages (also applies to
Waitoutcomes)?
We have not experienced this much (might be more relevant if we would have UDP transport) and have no concrete user requests for this -> prioritize lower, still aiming 1.0.0 but maybe not.