ganga
ganga copied to clipboard
Investigate if jobs can enter monitoring while in submitting stage
When a master job is submitting, it can take a very long time to submit the subjobs for certain remote backends (i.e. several hours if there are maybe 3000 subjobs). At the moment, the subjobs are not monitored during this period, so if some have finished already, we are effectively having deadtime in the system. Another benefit will be that if a job submission is terminated by the Ganga process getting killed, at least the already submitted subjobs will be recoverable. The current policy of failed submissions reverting the job to the new status should probably be changed to make this work.
@abhijeetsharma200 See further information here
At the moment the behaviour around submission and monitoring is the following
- On submission, a job is split into subjobs. Then if keep_going is
True
, ganga will attempt to submit all the subjobs, even if there are some failures along the way. The failed submissions will be left in thesubmitting
state. - The overall state of a job is determined from the status of all subjobs. If a single subjob is in
submitting
the complete subjob will be declared assubmitting
(see full status calculation). - Master jobs in
submitting
status are not monitored. The consequence is that monitoring will not start until all subjobs are submitted (can take well above 1 hour) and if a single subjob submission fails, the job will never be monitored.
I think we want a few changes in behaviour.
- Subjobs that fail to submit should be put into the
failed
state rather than left insubmitting
. - We should change it such that subjobs start to be monitored even while other subjobs are not yet submitted. This code seems to indicate that it is already the case, but I do not think it is. Some careful debugging might be required to understand.
- The
submitting
status is a transient status. So if theganga
process has been killed, then on startup, all subjobs in thesubmitting
status should be changed tofailed
.
I think the first step will be to make a set of tests where you can get subjobs to fail on command and can get subjobs to submit very slowly as a way of testing if monitoring is starting at the same time. The TestSubmitter is a dummy backend that can be used for this.