More graceful job cancellation
Hey all,
This is just a thought for the SLURMCluster for now (since that's what I'm familiar with) but similar options may be available in other clusters too. Currently, the cancel_command in the SLURMJob class is a bare "scancel".
https://github.com/dask/dask-jobqueue/blob/8713202488c664452bf0883bcd4f776536644676/dask_jobqueue/slurm.py#L15
This means that, even when workers are shutdown completely gracefully, the Slurm job is marked as CANCELLED. Instead, if the command were scancel --signal=SIGTERM the job would be marked as COMPLETED. Its possible there could be cases where we would want a job to cancelled, which complicates this somewhat.
In the simple case, however, I think this could be implmented with a simple change of cancel_command to:
class SLURMJob(Job):
# Override class variables
submit_command = "sbatch"
cancel_command = "scancel --signal=SIGTERM"
config_name = "slurm"
It'd be great to get some more thoughts on the implications for this.
This sounds like a great improvement. Do you have any interest in making a PR to add this option?
Happy to! Just wanted to check in to make sure there wouldn't be any more hidden gotchas
Hi! This sounds also perfectly acceptable to me. I don't think there is any case in which we would really like to have a CANCELLED status! Thanks for proposing this, and I think this might be possible with other schedulers too!
I see something like this was added for the HTCondor class in #411 and #514. I'll attempt to generalise