toil icon indicating copy to clipboard operation
toil copied to clipboard

Feature request in batch system api: status [Pending, Submitted, Error, Completed, etc] of running jobs from a given workflow.

Open kannon92 opened this issue 4 years ago • 1 comments
trafficstars

Our interest in adopting toil as our middleware for scheduling workflows on HPC systems is because of the ability to use multiple HPC systems with a single cli.

Currently, toil can launch jobs on the head node and they are scheduled for execution. We are trying to build a REST API that monitors individual jobs that are sent to these clusters. We can't use toil to get the status of these jobs. We have to resort to using the status commands for the batch systems.

It would be ideal if we have something like toil status workflow.cwl and it would tell you what jobs are running, pending, completed for that workflow.

I'm open to suggestions on should be returned in each job but my initial thought for HPC systems:

jobId (In HPC system), jobName, state

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1073

kannon92 avatar Nov 08 '21 15:11 kannon92

We think we might be able to use Toil's MessageBus to make this work, since information about the backing HPC jobs doesn't get stored in the JobStore.

We would need to:

  • [ ] Get the information about what HPC job corresponds to what Toil job into the MessageBus (maybe through JobAnnotationMessage like the AWS Batch batch system uses for attaching its backing IDs).
  • [ ] Generate some JobRunningMessages on the MessageBus when the Leader calls batchSystem.getRunningBatchJobIDs() and sees a new one.
  • [ ] Get the information out of the MessageBus and into something like toil status, alongside or instead of what can be read from the job store. This might involve duplicating or refactoring out some of the message bus replay code from the WES server.

adamnovak avatar Oct 04 '22 22:10 adamnovak

➤ Adam Novak commented:

Idea: keep workflow message bus logs in files, and ID them by the workflow run ID.

unito-bot avatar Oct 14 '22 16:10 unito-bot

Getting this to work for SlurmBatchSystem and AbstractGridEngineBatchSystem is a little tough, because we don't have access to the job_store_id for the job in the places where we have access to the Slurm- or other HPC-assigned ID.

@Hexotical thinks he can pull the job_store_id through along with the other job attributes. But working is complicated by the fact that a lot of the variable names and comments in the AbstractGridEngineBtachSystem are lies, or are at least different from what we call things in the rest of the code. It calls the BatchSystem-assigned ID a "Toil" jobID, it calls the HPC scheduler assigned ID a batchSystemID, and it doesn't actually work with the job_store_id. It also lacks good type information, and seems to be thinking that that things that really are integers might be strings.

So there might need to be some cleanup that happens first.

adamnovak avatar Nov 08 '22 23:11 adamnovak