maestrowf
maestrowf copied to clipboard
Resume from failure point
I have noticed that the execution graph is pickled. Is there a plan for a way to resume a study execution from its last point of failure?
Can you clarify what you mean by last point of failure? I can think of two points of failure here:
- When steps fail -- do you want to be able to reload a study directory and attempt to relaunch failed jobs?
- If the conductor fails -- This one is a little more complicated depending on how long a scheduler maintains data on job state and such. The cases for this are below (I initially took the question this way so this one is more flushed out):
- If a step hasn't begun, check if it's ready to be executed. If it is, execute it.
- If a step failed, nothing to be done. Make sure that its children have also been set to failed. (alternatively, some simulations rely on random number generation so a restart could yield a valid run.)
- If a step is complete, check its dependent jobs
- If a step was running, it's complicated. Check to see if the last known job ID is still running. This check becomes complicated because the history retained by the scheduler may be too short to pick up how the step completed.
- If the step is still running, then the check is easy enough and we can proceed to track it as normal.
- Schedulers do not maintain job information for prolonged periods of time (At least that's the case for SLURM). We have two options 1. Assume it as a failure if we can't see it. 2. Attempt to restart - even though we don't know if the previous run was a failure.
@jwhite242 the Hermit team is interested in something like that.
@jwhite242 the Hermit team is interested in something like that.
@doutriaux1
What kind of needs are there on the 'restart' behavior? There's a few cases here i think:
- simply launch a restart script if one is provided in the spec already
- facility for adding such a thing to the spec in the middle and generating one
- what about editing inputs/things in the workspace? i.e. the common case when a simulation fails and it's input config needs slight tweaks that need to be preserved (reruning original step script would likely blow this away)
- option to completely start over instead of resume from check point? (and if so, keep the old workspace for provenance, or blow it away)