galaxy
galaxy copied to clipboard
Fix/slurm out of memory warning exit code 0:125
We receive these warning messages (OUT_OF_MEMORY but with exit code 0:125) a lot on our cluster and the jobs are to be considered successful https://bugs.schedmd.com/show_bug.cgi?id=3820#c62
@scholtalbers If I read https://bugs.schedmd.com/show_bug.cgi?id=3820#c62 correctly, with SLURM 17.11.3 this will not do the right thing any more, since OUT_OF_MEMORY
state with ExitCode 0:125
will identify the real OOM-killed jobs.
we have 17.11.6 and it does not correctly identify oom killed jobs. I don't know the exact details, but I as far as I understood if an event like this is detected on the node (for whatever reason) this will become the job state, even though the job was not affected.
I'm unsure on this it looks like code 0:125
should only be happening if processes were really killed. You're sure nothing's been killed when you're seeing this?
Positive. However, if in doubt, we can also make this a configurable option?
Yeah, that'd be ideal I think.