galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

Fix/slurm out of memory warning exit code 0:125

Open scholtalbers opened this issue 5 years ago • 5 comments

We receive these warning messages (OUT_OF_MEMORY but with exit code 0:125) a lot on our cluster and the jobs are to be considered successful https://bugs.schedmd.com/show_bug.cgi?id=3820#c62

scholtalbers avatar Oct 08 '18 13:10 scholtalbers

@scholtalbers If I read https://bugs.schedmd.com/show_bug.cgi?id=3820#c62 correctly, with SLURM 17.11.3 this will not do the right thing any more, since OUT_OF_MEMORY state with ExitCode 0:125 will identify the real OOM-killed jobs.

nsoranzo avatar Oct 08 '18 14:10 nsoranzo

we have 17.11.6 and it does not correctly identify oom killed jobs. I don't know the exact details, but I as far as I understood if an event like this is detected on the node (for whatever reason) this will become the job state, even though the job was not affected.

scholtalbers avatar Oct 08 '18 15:10 scholtalbers

I'm unsure on this it looks like code 0:125 should only be happening if processes were really killed. You're sure nothing's been killed when you're seeing this?

natefoo avatar Oct 08 '18 17:10 natefoo

Positive. However, if in doubt, we can also make this a configurable option?

scholtalbers avatar Oct 08 '18 17:10 scholtalbers

Yeah, that'd be ideal I think.

natefoo avatar Oct 08 '18 17:10 natefoo