Actors.jl
Actors.jl copied to clipboard
Implement basic checkpointing
Now with basic error handling (see issue #16 and description in the manual) there is still an issue of maintaining/saving and restoring actor state at termination and restart.
For actor and task restart (by supervisors) checkpoint
and restore
is an important option. Thus actor state can be restored at restart.
Actor initialization and termination with user defined callback functions
- [x] develop
init!
functionality, - [x] develop
term!
functionality, - [x] implement the restart strategy described below.
User-defined checkpointing:
- [x] basic
checkpointing
actor, - [x]
checkpoint
call, - [x]
restore
call, - [x] checkpointing hierarchy,
- [x] checkpointing interval for 2nd level.
Integration
- [ ] tests,
- [ ] documentation,
- [ ] examples
There are multiple ways to realize checkpointing with Actors
:
- explicit, user-defined checkpointing based on the functionality mentioned above,
- use checkpoints in
init!
andterm!
callback functions, - saving intermediate results to a guarded variable, and the
:guard
actor does the checkpointing, - checkpoints could be served by a
:genserver
actor, - function memoization saved to a guarded/checkpointed
Dict
, - ...
This leads to the impression that only the basic checkpointing functionality (1, 2) should go into Actors
while the more sophisticated stuff should be realized in a framework library like Checkpoints.jl
Maybe people working with HPC like @oschulz could comment on this.
Init/terminate and restart strategy
Checkpointing must be integrated within init
, term
and restart
callbacks:
- A user can provide a termination callback taking a
checkpoint
of an actor's acquaintance variables upon abnormal actor exit. - A user can provide an init and/or restart callback to look for a checkpoint and to eventually
restore
it before actor start. - Then a monitor can be specified to call an init or restart
action
on actor termination. -
Supervisors should do the following for actor restart:
-
if specified, call a
restart
function (must return a link to a started actor), -
else if specified, call an
init
function (must return a link to a started actor), -
else
spawn
an actor with the last behavior.
-
if specified, call a
Anyway to enable the 4th point, a failed actor's variable must be delivered to the supervisor on abnormal exit. (already done)
Enable Multilevel-Checkpointing
In-memory checkpointing has been demonstrated to be fast and scalable. In addition, multilevel checkpointing technologies ... are increasingly popular. Multilevel checkpointing involves combining several storage technologies (multiple levels) to store the checkpoint, each level offering specific overhead and reliability trade-offs. The checkpoint is first stored in the highest performance storage technology, which generally is the local memory or a local SSD. This level supports process failure but not node failure. The second level stores the checkpoint in remote memory of remote SSD. This second level supports a single-node failure. The third level corresponds to the encoding the checkpoints in blocks and in distributing the blocks in multiple nodes. This approach supports multinode failures. Generally, the last level of checkpointing is still the parallel file system. This last level supports catastrophic failures such as full system outage. (see: D6.1 Preliminary Report on State of the Art Software-Based Fault Tolerance Techniques)
The checkpointing
actor should support multilevel as a combination of user-level and application-level checkpointing. The concept is as follows:
- the user-actor takes application-specific
checkpoint
s to acheckpointing
actor running on the same node (under the samesupervisor
). This usually is an in-memory checkpoint. - Then a 2nd level
checkpoint
is taken at regular intervals by an aggregatingcheckpointing
actor residing on another node. The aggregation is done from 1st-level or other 2nd-level actors. - The third level is the organization of the checkpointing and restart/restore hierarchy.
The first two levels should be realized in Actors
. The third level can then be realized in a separate framework library Checkpointing.jl
.
Multilevel checkpointing mechanism
- 1st level checkpoints must be inexpensive (in-memory). Thus they can be taken frequently. ¹
- 2nd level checkpoints can consist of arbitrary hierarchy levels.
- They are usually triggered and/or saved at regular intervals (can be parametrized for each actor).
- When they are triggered, they collect the current checkpoints from all inferior checkpointing actors.
- 2nd level checkpoints and save routines can also be triggered from inferior actors. In that case such requests are propagated upwards in the checkpointing hierarchy until the required level is reached.
- Therefore the
checkpoint
function must have alevel
argument.
- The
restore
function also must have alevel
argument. The restoring then is done from the chosen level.
—————- ¹ 1st level checkpoints are triggered by a user actor. Automatic checkpointing at the 1st level must wait until Julia Atomics is ready.
Checkpointing intervals
Update
and Save
intervals must be configurable and it must also be possible to turn automatic checkpointing on and off.
Detect and handle node failures
Supervisors looking after actors on other workers must detect node failures and handle them appropriately.
- They start a task that
- polls
isready
on theRemoteChannel
of the supervised actor and - sends a
ProcessExitedException
as anExit
message to the supervisor.
- polls
- They treat those exceptions differently from actor
Exit
s since they don't contain actor state.
Tasks on other nodes must have a supervisor actor on their node. That one can be supervised by a remote supervisor as described above.
Tasks on other nodes must have a supervisor actor on their node. That one can be supervised by a remote supervisor as described above.
This is not a good strategy to work with! The consequence of that requirement would be to have two supervisor levels where we need only one. Therefore:
Introduce an actor for node supervision
If a supervisor gets a child on another node, it starts another helper child, responsible for scanning regularly the connections to foreign actors.
If a connection gives an ProcessExitedException
, the helper informs the supervisor about the failed PID
thus that it can act accordingly restart the foreign actors on another node. The helper must send only one Exit
message to the supervisor for a failed PID
even if it hosts multiple child actors.
Possibly useful to this feature, the forthcoming Julia 1.7 includes the ability to migrate tasks between threads: https://github.com/JuliaLang/julia/pull/40715
I don't understand the code of the implementation here, but it seems like it enables the thread-local storage to be tracked with each Task? Regardless, this change may be of for checkpointing or actions taken by supervisors