transient-universe icon indicating copy to clipboard operation
transient-universe copied to clipboard

Job control to continue, restart, cancel distributed jobs

Open agocorona opened this issue 5 years ago • 2 comments

A jobcontrol primitive that can get Cloud computations as parameter and let the user decide what to do when an non-handled exception happens: either continue after the problem is fixed, using the log facility, restart anew or cancel the job. That control will work regardless of the node where currently it is executing.

For example A program would invoke some distributed facility that is not running/not installed. The user can retry and if it continue failing he can check if it is down, has a failure or is not installed. He can stop the computation, install it and resume execution without re-execting possibly heavy tasks already done at that point.

Messages to the user will appear in the console of the node that initiated the cloud computation and will be managed with console primitives like option and input

This comes from cloudshell

agocorona avatar Jan 28 '20 21:01 agocorona

Also, add an option to see online the execution log of each job

agocorona avatar Jan 29 '20 08:01 agocorona

The motivation of this is because although there is job control using services (see the executor service) it is not able by design to execute a sequence of distributed computations and make them optionally restart/continue on failure. For some heavy processeses it is good to log/cache results and avoid to re-execute what was already done.

agocorona avatar Jan 29 '20 14:01 agocorona