dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

More graceful job cancellation

Open AlecThomson opened this issue 1 year ago • 4 comments

Hey all,

This is just a thought for the SLURMCluster for now (since that's what I'm familiar with) but similar options may be available in other clusters too. Currently, the cancel_command in the SLURMJob class is a bare "scancel".

https://github.com/dask/dask-jobqueue/blob/8713202488c664452bf0883bcd4f776536644676/dask_jobqueue/slurm.py#L15

This means that, even when workers are shutdown completely gracefully, the Slurm job is marked as CANCELLED. Instead, if the command were scancel --signal=SIGTERM the job would be marked as COMPLETED. Its possible there could be cases where we would want a job to cancelled, which complicates this somewhat.

In the simple case, however, I think this could be implmented with a simple change of cancel_command to:

class SLURMJob(Job):
    # Override class variables
    submit_command = "sbatch"
    cancel_command = "scancel --signal=SIGTERM"
    config_name = "slurm"

It'd be great to get some more thoughts on the implications for this.

AlecThomson avatar May 22 '24 12:05 AlecThomson

This sounds like a great improvement. Do you have any interest in making a PR to add this option?

jacobtomlinson avatar May 24 '24 08:05 jacobtomlinson

Happy to! Just wanted to check in to make sure there wouldn't be any more hidden gotchas

AlecThomson avatar May 24 '24 09:05 AlecThomson

Hi! This sounds also perfectly acceptable to me. I don't think there is any case in which we would really like to have a CANCELLED status! Thanks for proposing this, and I think this might be possible with other schedulers too!

guillaumeeb avatar May 24 '24 14:05 guillaumeeb

I see something like this was added for the HTCondor class in #411 and #514. I'll attempt to generalise

AlecThomson avatar May 25 '24 01:05 AlecThomson