texera icon indicating copy to clipboard operation
texera copied to clipboard

Amber Fault Tolerance: Global Recovery and Detection

Open shengquan-ni opened this issue 2 years ago • 0 comments

This PR finished recovery, with some refactoring changes for the DP Thread and network communication actor:

  1. Added global recovery manager which manages the workflow recovery state. It notifies the client when recovery starts and completes.
  2. Uiltized Akka cluster to detect node failures. If Akka says one node is removed from the cluster, we let every controller know about this event and start global recovery. For now, we use the first node in the cluster to deploy respawned node. This need to be resource-aware to prevent overloading the first node.
  3. This fault tolerance implementation only supports workflow without python UDF operators.

shengquan-ni avatar Sep 29 '22 06:09 shengquan-ni