submitit icon indicating copy to clipboard operation
submitit copied to clipboard

short jobs timeout immediately

Open MartinSmeyer opened this issue 3 years ago • 3 comments

There are some magic numbers in the process of determining whether a job timed out:

https://github.com/facebookincubator/submitit/blob/b9830478a6f3f8e5626a12245a3a309c0c64fb02/submitit/core/job_environment.py#L161-L168

This seems to cause short jobs to immediately timeout:

[2021-10-21 11:50:15,215][submitit][INFO] - Job has timed out. Ran 0 minutes out of requested 2 minutes.
[2021-10-21 11:50:15,216][submitit][WARNING] - Caught signal SIGUSR1 on rmc-gpu05: this job is timed-out.
[2021-10-21 11:50:15,216][submitit][INFO] - Calling checkpoint method.
[2021-10-21 11:50:15,256][submitit][INFO] - Job not requeued because: timed-out too many times.

MartinSmeyer avatar Oct 21 '21 11:10 MartinSmeyer

so you have a job that last less than 10 minutes and that still get preempted ? That sounds weird. The idea here is that it is hard to know if a job was preempted or timed-out. If the job receives the signal close to the end of the job we assume it's a time-out. Normally short jobs aren't preempted, just because it's more work for the cluster to preempt and reschedule them than letting them finish. Is there a GraceTime on your cluster ?

I guess we should at least check that max_walltime is indeed bigger than 10 minutes.

gwenzek avatar Nov 09 '21 13:11 gwenzek

I guess we should at least check that max_walltime is indeed bigger than 10 minutes.

Yes I think so. All jobs with less than 10 minutes time limit (=max_walltime) are preempted.

We did not overwrite the GraceTime setting on our cluster and the default is at zero seconds: https://slurm.schedmd.com/slurm.conf.html#OPT_GraceTime

MartinSmeyer avatar Nov 16 '21 15:11 MartinSmeyer

I was about to implement that, but I have some second thoughts. The idea is that (at least in our infra) it's not recommended to submit 10 minutes jobs. This adds a lot of pressure on the scheduler for very little work. Slurm works best for long running job.

The thing is that SLURM will send the timeout signal before the requested timeout. And also we want to handle some requeue signal has timeout if they happen very close to the timeout. That's why we use this 80% cutoff.

gwenzek avatar Apr 05 '22 12:04 gwenzek