cruise icon indicating copy to clipboard operation
cruise copied to clipboard

Implement Checkpoint in Dolphin

Open yunseong opened this issue 8 years ago • 1 comments

Dolphin has not considered much about fault tolerance. To handle the failed Task/Contexts, we need to 1) clarify that which states must be maintained, and 2) create a checkpoint mechanism to store/load the information necessary for restoration (e.g., model parameters, iteration number).

yunseong avatar Jul 06 '16 06:07 yunseong

Two strategies are possible for checkpoint:

  1. Stop-the-world
  2. Asynchronous

We can easily notice a trade-off between performance and correctness.

yunseong avatar Oct 04 '16 09:10 yunseong