texera
texera copied to clipboard
Amber Fault Tolerance: Global Recovery and Detection
This PR finished recovery, with some refactoring changes for the DP Thread and network communication actor:
- Added global recovery manager which manages the workflow recovery state. It notifies the client when recovery starts and completes.
- Uiltized Akka cluster to detect node failures. If Akka says one node is removed from the cluster, we let every controller know about this event and start global recovery. For now, we use the first node in the cluster to deploy respawned node. This need to be resource-aware to prevent overloading the first node.
- This fault tolerance implementation only supports workflow without python UDF operators.