texera icon indicating copy to clipboard operation
texera copied to clipboard

Amber Fault Tolerance: Local Recovery

Open shengquan-ni opened this issue 2 years ago • 0 comments

This PR is mainly about adding a replay mechanism to both controller and worker but I also did the following along with the changes:

  1. Refactored the logging part by extracting a common interface from the logging implementation to allow empty logger/storage so that we don't need to add a lot of checks to see if fault tolerance is enabled or not in the code.
  2. In network communication actor, we used to send fire-and-forget messages to register actorVid with their actor ref, now we wait for the completion in the main thread to avoid possible issues when we want to send a message but the actor ref is not yet registered.
  3. Improved the flow of worker shutdown. Now it's a more synchronized process with the following shutdown sequence:
    • worker actor receives shutdown
    • wait for logging thread to shutdown, if fault tolerance is enabled
    • wait for the DP thread to shutdown
    • shutdown itself and asynchronously shutdown its network sender.

The followings are some issues we had in the code which ruined the recovery process. I've fixed them in this PR.

  1. Return message for QuerySelfWorkloadMetrics has a mutable state and it is modified in the controller's handler. Since we have the buffer design and all pre-committed log records are just pointers in memory, if we modify the message content during the processing, the log records are also modified, which means the original input message is gone.
  2. previousCallFinished of MonitoringHandler is in the global scope. The controller or any worker is not supposed to modify or use any global state, this will cause its behavior to become non-deterministic. In order to make sure fault tolerance works in the future, we must be aware of the above issues when committing code to the engine.

shengquan-ni avatar Sep 25 '22 22:09 shengquan-ni