genome-grist icon indicating copy to clipboard operation
genome-grist copied to clipboard

out-of-memory error

Open jeanzzhao opened this issue 2 years ago • 2 comments

One of my genome-grist runs on 24 human host-associated samples failed due to out-of-memory error. less /home/zyzhao/2022-assemblyloss/trials/3_ERR_Mgnify/grist/jobs/grist.j53330088.err

...
Write-protecting output file outputs.24samples/abundtrim/ERR505092.abundtrim.fq.gz.
[Tue Aug 16 13:29:44 2022]
Finished job 187.
527 of 2729 steps (19%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/zyzhao/2022-assemblyloss/trials/3_ERR_Mgnify/grist/.snakemake/log/2022-08-15T200943.042863.snakemake.log
Error in snakemake invocation: Command '['snakemake', '-s', '/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile', '-j', '1', '--use-conda', 'summarize_gather', 'summarize_mapping', '--rerun-incomplete', '--cores', '11', '--configfile', '/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/defaults.conf', '/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/system.conf', 'conf_ERR.yml']' returned non-zero exit status 1.
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=53330088.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

less /home/zyzhao/2022-assemblyloss/trials/3_ERR_Mgnify/grist/jobs/grist.j53330088.out shows:

SLURM_JOB_ID = 53330088
SLURM_NODELIST = c6-86
==========================================
Name                : grist
User                : zyzhao
Account             : ctbrowngrp
Partition           : med2
Nodes               : c6-86
Cores               : 33
GPUs                : 0
State               : OUT_OF_MEMORY
ExitCode            : 0:125
Submit              : 2022-08-15T20:09:36
Start               : 2022-08-15T20:09:38
End                 : 2022-08-16T13:29:50
Waited              :   00:00:02
Reserved walltime   : 5-00:00:00
Used walltime       :   17:20:12
Used CPU time       : 1-12:52:19
% User (Computation): 88.27%
% System (I/O)      : 11.73%
Mem reserved        : 100G
Max Mem used        : 226.96G (c6-86)
Max Disk Write      : 51.20K (c6-86)
Max Disk Read       : 17.91M (c6-86)

the Snakefile(below) reserved 100G mem, but ~227G mem used

#!/bin/bash -login
#SBATCH -p med2                # use 'med2' partition for medium priority
#SBATCH -J grist               # name for job
#SBATCH -c 11                   # 11 core, 
#SBATCH -t 5-00:00:00             # ask for 5 days
#SBATCH --mem=100G             # memory (2000 mb = 2gb)
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -e jobs/grist.j%j.err
#SBATCH -o jobs/grist.j%j.out

# initialize conda
> . ~/mambaforge/etc/profile.d/conda.sh

# activate your desired conda environment
conda activate grist

# fail on weird errors
set -e
set -x

### YOUR COMMANDS GO HERE ### Tessa: -
genome-grist run conf_ERR.yml summarize_gather summarize_mapping --unlock
genome-grist run conf_ERR.yml summarize_gather summarize_mapping --rerun-incomplete --cores 11

# Print out values of the current jobs SLURM environment variables
env | grep SLURM

# Print out final statistics about resource use before job exits
scontrol show job ${SLURM_JOB_ID}
sstat --format 'JobID,MaxRSS,AveCPU' -P ${SLURM_JOB_ID}.batch

suggest increase --mem=200G, and rerun

jeanzzhao avatar Aug 28 '22 01:08 jeanzzhao

Thanks for posting, jean! There's a way to get genome-grist to pay more attention to memory, and I wanted to get a chance to write it down for future inclusion in the docs as well as to help you out!

In brief, there are only two or three memory-intensive steps in genome-grist. Those are (1) the trim-low-abund step https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile#L596, and (2) the sourmash prefetch and (3) gather steps. Inconveniently these are the longest-running steps, too.

If you run with -j 11 genome-grist will assume that it can run all those steps in parallel, which is why you got an out of memory error - it ran multiple steps of the biggest memory options all at the same time, and the memory multiplied!

So the solution here is to tell genome-grist how much total memory it has to work with so it can limit memory appropriately. You can do this by passing in genome-grist --resources mem_mb=100000 to limit genome-grist to 100 GB.

Another option that we should explore is having genome-grist automatically read in the Slurm memory and CPU limitations and pass them on to snakemake underneath, unless they are overridden on the command line. This should be pretty easy, actually...

ctb avatar Aug 29 '22 13:08 ctb

I think this should be added to the documentation.

ctb avatar Sep 27 '22 10:09 ctb