Actors.jl Implement basic checkpointing

Now with basic error handling (see issue #16 and description in the manual) there is still an issue of maintaining/saving and restoring actor state at termination and restart.

For actor and task restart (by supervisors) checkpoint and restore is an important option. Thus actor state can be restored at restart.

Actor initialization and termination with user defined callback functions

[x] develop init! functionality,
[x] develop term! functionality,
[x] implement the restart strategy described below.

User-defined checkpointing:

[x] basic checkpointing actor,
[x] checkpoint call,
[x] restore call,
[x] checkpointing hierarchy,
[x] checkpointing interval for 2nd level.

Integration

[ ] tests,
[ ] documentation,
[ ] examples

Feb 13 '21 17:02 pbayer

There are multiple ways to realize checkpointing with Actors:

explicit, user-defined checkpointing based on the functionality mentioned above,
use checkpoints in init! and term! callback functions,
saving intermediate results to a guarded variable, and the :guard actor does the checkpointing,
checkpoints could be served by a :genserver actor,
function memoization saved to a guarded/checkpointed Dict,
...

This leads to the impression that only the basic checkpointing functionality (1, 2) should go into Actors while the more sophisticated stuff should be realized in a framework library like Checkpoints.jl

Maybe people working with HPC like @oschulz could comment on this.

Feb 15 '21 07:02 pbayer

Init/terminate and restart strategy

Checkpointing must be integrated within init, term and restart callbacks:

A user can provide a termination callback taking a checkpoint of an actor's acquaintance variables upon abnormal actor exit.
A user can provide an init and/or restart callback to look for a checkpoint and to eventually restore it before actor start.
Then a monitor can be specified to call an init or restart action on actor termination.
Supervisors should do the following for actor restart:
1. if specified, call a restart function (must return a link to a started actor),
2. else if specified, call an init function (must return a link to a started actor),
3. else spawn an actor with the last behavior.

~~Anyway to enable the 4th point, a failed actor's variable must be delivered to the supervisor on abnormal exit.~~ (already done)

Feb 17 '21 08:02 pbayer

Enable Multilevel-Checkpointing

In-memory checkpointing has been demonstrated to be fast and scalable. In addition, multilevel checkpointing technologies ... are increasingly popular. Multilevel checkpointing involves combining several storage technologies (multiple levels) to store the checkpoint, each level offering specific overhead and reliability trade-offs. The checkpoint is first stored in the highest performance storage technology, which generally is the local memory or a local SSD. This level supports process failure but not node failure. The second level stores the checkpoint in remote memory of remote SSD. This second level supports a single-node failure. The third level corresponds to the encoding the checkpoints in blocks and in distributing the blocks in multiple nodes. This approach supports multinode failures. Generally, the last level of checkpointing is still the parallel file system. This last level supports catastrophic failures such as full system outage. (see: D6.1 Preliminary Report on State of the Art Software-Based Fault Tolerance Techniques)

The checkpointing actor should support multilevel as a combination of user-level and application-level checkpointing. The concept is as follows:

the user-actor takes application-specific checkpoints to a checkpointing actor running on the same node (under the same supervisor). This usually is an in-memory checkpoint.
Then a 2nd level checkpoint is taken at regular intervals by an aggregating checkpointing actor residing on another node. The aggregation is done from 1st-level or other 2nd-level actors.
The third level is the organization of the checkpointing and restart/restore hierarchy.

The first two levels should be realized in Actors. The third level can then be realized in a separate framework library Checkpointing.jl.

Feb 22 '21 10:02 pbayer

Multilevel checkpointing mechanism

1st level checkpoints must be inexpensive (in-memory). Thus they can be taken frequently. ¹
2nd level checkpoints can consist of arbitrary hierarchy levels.
1. They are usually triggered and/or saved at regular intervals (can be parametrized for each actor).
2. When they are triggered, they collect the current checkpoints from all inferior checkpointing actors.
3. 2nd level checkpoints and save routines can also be triggered from inferior actors. In that case such requests are propagated upwards in the checkpointing hierarchy until the required level is reached.
4. Therefore the checkpoint function must have a level argument.
The restore function also must have a level argument. The restoring then is done from the chosen level.

—————- ¹ 1st level checkpoints are triggered by a user actor. Automatic checkpointing at the 1st level must wait until Julia Atomics is ready.

Feb 24 '21 10:02 pbayer

Checkpointing intervals

Update and Save intervals must be configurable and it must also be possible to turn automatic checkpointing on and off.

Feb 25 '21 07:02 pbayer

Detect and handle node failures

Supervisors looking after actors on other workers must detect node failures and handle them appropriately.

They start a task that
1. polls isready on the RemoteChannel of the supervised actor and
2. sends a ProcessExitedException as an Exit message to the supervisor.
They treat those exceptions differently from actor Exits since they don't contain actor state.

Tasks on other nodes must have a supervisor actor on their node. That one can be supervised by a remote supervisor as described above.

Mar 01 '21 10:03 pbayer

Tasks on other nodes must have a supervisor actor on their node. That one can be supervised by a remote supervisor as described above.

This is not a good strategy to work with! The consequence of that requirement would be to have two supervisor levels where we need only one. Therefore:

Introduce an actor for node supervision

If a supervisor gets a child on another node, it starts another helper child, responsible for scanning regularly the connections to foreign actors.

If a connection gives an ProcessExitedException, the helper informs the supervisor about the failed PID thus that it can act accordingly restart the foreign actors on another node. The helper must send only one Exit message to the supervisor for a failed PID even if it hosts multiple child actors.

Mar 04 '21 09:03 pbayer

Possibly useful to this feature, the forthcoming Julia 1.7 includes the ability to migrate tasks between threads: https://github.com/JuliaLang/julia/pull/40715

I don't understand the code of the implementation here, but it seems like it enables the thread-local storage to be tracked with each Task? Regardless, this change may be of for checkpointing or actions taken by supervisors

Oct 25 '21 21:10 caleb-allen

Actors.jl Actors.jl copied to clipboard

Implement basic checkpointing

Actor initialization and termination with user defined callback functions

User-defined checkpointing:

Integration

Init/terminate and restart strategy

Enable Multilevel-Checkpointing

Multilevel checkpointing mechanism

Checkpointing intervals

Detect and handle node failures

Introduce an actor for node supervision

Actors.jl
Actors.jl copied to clipboard