ganga icon indicating copy to clipboard operation
ganga copied to clipboard

Investigate if jobs can enter monitoring while in submitting stage

Open egede opened this issue 2 years ago • 2 comments

When a master job is submitting, it can take a very long time to submit the subjobs for certain remote backends (i.e. several hours if there are maybe 3000 subjobs). At the moment, the subjobs are not monitored during this period, so if some have finished already, we are effectively having deadtime in the system. Another benefit will be that if a job submission is terminated by the Ganga process getting killed, at least the already submitted subjobs will be recoverable. The current policy of failed submissions reverting the job to the new status should probably be changed to make this work.

egede avatar Feb 27 '23 10:02 egede

@abhijeetsharma200 See further information here

At the moment the behaviour around submission and monitoring is the following

  • On submission, a job is split into subjobs. Then if keep_going is True, ganga will attempt to submit all the subjobs, even if there are some failures along the way. The failed submissions will be left in the submitting state.
  • The overall state of a job is determined from the status of all subjobs. If a single subjob is in submitting the complete subjob will be declared as submitting (see full status calculation).
  • Master jobs in submitting status are not monitored. The consequence is that monitoring will not start until all subjobs are submitted (can take well above 1 hour) and if a single subjob submission fails, the job will never be monitored.

I think we want a few changes in behaviour.

  • Subjobs that fail to submit should be put into the failed state rather than left in submitting.
  • We should change it such that subjobs start to be monitored even while other subjobs are not yet submitted. This code seems to indicate that it is already the case, but I do not think it is. Some careful debugging might be required to understand.
  • The submitting status is a transient status. So if the ganga process has been killed, then on startup, all subjobs in the submitting status should be changed to failed.

egede avatar May 01 '23 03:05 egede

I think the first step will be to make a set of tests where you can get subjobs to fail on command and can get subjobs to submit very slowly as a way of testing if monitoring is starting at the same time. The TestSubmitter is a dummy backend that can be used for this.

egede avatar May 01 '23 03:05 egede