cruise
cruise copied to clipboard
Implement Checkpoint in Dolphin
Dolphin has not considered much about fault tolerance. To handle the failed Task/Contexts, we need to 1) clarify that which states must be maintained, and 2) create a checkpoint mechanism to store/load the information necessary for restoration (e.g., model parameters, iteration number).
Two strategies are possible for checkpoint:
- Stop-the-world
- Asynchronous
We can easily notice a trade-off between performance and correctness.