submitit
submitit copied to clipboard
short jobs timeout immediately
There are some magic numbers in the process of determining whether a job timed out:
https://github.com/facebookincubator/submitit/blob/b9830478a6f3f8e5626a12245a3a309c0c64fb02/submitit/core/job_environment.py#L161-L168
This seems to cause short jobs to immediately timeout:
[2021-10-21 11:50:15,215][submitit][INFO] - Job has timed out. Ran 0 minutes out of requested 2 minutes.
[2021-10-21 11:50:15,216][submitit][WARNING] - Caught signal SIGUSR1 on rmc-gpu05: this job is timed-out.
[2021-10-21 11:50:15,216][submitit][INFO] - Calling checkpoint method.
[2021-10-21 11:50:15,256][submitit][INFO] - Job not requeued because: timed-out too many times.
so you have a job that last less than 10 minutes and that still get preempted ? That sounds weird. The idea here is that it is hard to know if a job was preempted or timed-out. If the job receives the signal close to the end of the job we assume it's a time-out. Normally short jobs aren't preempted, just because it's more work for the cluster to preempt and reschedule them than letting them finish. Is there a GraceTime on your cluster ?
I guess we should at least check that max_walltime is indeed bigger than 10 minutes.
I guess we should at least check that
max_walltime
is indeed bigger than 10 minutes.
Yes I think so. All jobs with less than 10 minutes time limit (=max_walltime
) are preempted.
We did not overwrite the GraceTime setting on our cluster and the default is at zero seconds: https://slurm.schedmd.com/slurm.conf.html#OPT_GraceTime
I was about to implement that, but I have some second thoughts. The idea is that (at least in our infra) it's not recommended to submit 10 minutes jobs. This adds a lot of pressure on the scheduler for very little work. Slurm works best for long running job.
The thing is that SLURM will send the timeout signal before the requested timeout. And also we want to handle some requeue signal has timeout if they happen very close to the timeout. That's why we use this 80% cutoff.