sheffield_hpc icon indicating copy to clipboard operation
sheffield_hpc copied to clipboard

FAQ item missing: Out of memory

Open jkwmoore opened this issue 11 months ago • 2 comments

I think we're missing a good FAQ item for out of memory Slurm jobs.

  • Must mention that you can check job status with command showing OOM
  • Must mention you can check how much it got up to with 'seff'
  • Must mention that the metrics shown by Slurm may not be truly accurate due to the metric polling interval for Slurm being slower than the CGroup limit enforcement.

See also: https://groups.google.com/g/slurm-users/c/KQ_NOPAN5xA for

One should keep in mind that sacct results for memory usage are not accurate for Out Of Memory (OoM) jobs. This is due to the fact that the job is typically terminated prior to next sacct polling period, and also terminated prior to it reaching full memory allocation. Thus I wouldn't trust any of the results with regards to memory usage if the job is terminated by OoM. sacct just can't pick up a sudden memory spike like that and even if it did it would not correctly record the peak memory because the job was terminated prior to that point.

jkwmoore avatar Mar 21 '24 16:03 jkwmoore