Too many failed allocations

Open louisponet opened this issue 3 years ago • 1 comments

Hi,

I've been running into the error in the attached screenshot when running an alloc with slurm that used to work (before recent maintenance on the cluster). I was wondering, is this because the sbatch command has failed too many times, i.e. an issue with the slurm installation? Is there a way to allow for more failures or some other hacky way around possible issues with the slurm being overloaded on this cluster?

EDIT: this does not happen immediately but after some variable amount of time, not clear exactly how long or if it's just dependent on the pressure on the slurm daemon, if that's the culprit.

Cheers, Louis

Jul 27 '22 14:07 louisponet

Hi, the autoallocator gives up after several failed job submissions, in order to avoid spamming the job manager forever if it can't manage to schedule something successfully.

We'd need to see more information to see why it fails though, you should be able to see these in the root Hyperqueue directory ($HOME/.hq-server or something like that). You should be able to find logs from the submissions there. You can also try to display the allocation history to see if there are any errors.

Jul 28 '22 18:07 Kobzol