genome-grist
genome-grist copied to clipboard
out-of-memory error
One of my genome-grist runs on 24 human host-associated samples failed due to out-of-memory
error.
less /home/zyzhao/2022-assemblyloss/trials/3_ERR_Mgnify/grist/jobs/grist.j53330088.err
...
Write-protecting output file outputs.24samples/abundtrim/ERR505092.abundtrim.fq.gz.
[Tue Aug 16 13:29:44 2022]
Finished job 187.
527 of 2729 steps (19%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/zyzhao/2022-assemblyloss/trials/3_ERR_Mgnify/grist/.snakemake/log/2022-08-15T200943.042863.snakemake.log
Error in snakemake invocation: Command '['snakemake', '-s', '/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile', '-j', '1', '--use-conda', 'summarize_gather', 'summarize_mapping', '--rerun-incomplete', '--cores', '11', '--configfile', '/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/defaults.conf', '/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/system.conf', 'conf_ERR.yml']' returned non-zero exit status 1.
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=53330088.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
less /home/zyzhao/2022-assemblyloss/trials/3_ERR_Mgnify/grist/jobs/grist.j53330088.out
shows:
SLURM_JOB_ID = 53330088
SLURM_NODELIST = c6-86
==========================================
Name : grist
User : zyzhao
Account : ctbrowngrp
Partition : med2
Nodes : c6-86
Cores : 33
GPUs : 0
State : OUT_OF_MEMORY
ExitCode : 0:125
Submit : 2022-08-15T20:09:36
Start : 2022-08-15T20:09:38
End : 2022-08-16T13:29:50
Waited : 00:00:02
Reserved walltime : 5-00:00:00
Used walltime : 17:20:12
Used CPU time : 1-12:52:19
% User (Computation): 88.27%
% System (I/O) : 11.73%
Mem reserved : 100G
Max Mem used : 226.96G (c6-86)
Max Disk Write : 51.20K (c6-86)
Max Disk Read : 17.91M (c6-86)
the Snakefile
(below) reserved 100G mem, but ~227G mem used
#!/bin/bash -login
#SBATCH -p med2 # use 'med2' partition for medium priority
#SBATCH -J grist # name for job
#SBATCH -c 11 # 11 core,
#SBATCH -t 5-00:00:00 # ask for 5 days
#SBATCH --mem=100G # memory (2000 mb = 2gb)
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH -e jobs/grist.j%j.err
#SBATCH -o jobs/grist.j%j.out
# initialize conda
> . ~/mambaforge/etc/profile.d/conda.sh
# activate your desired conda environment
conda activate grist
# fail on weird errors
set -e
set -x
### YOUR COMMANDS GO HERE ### Tessa: -
genome-grist run conf_ERR.yml summarize_gather summarize_mapping --unlock
genome-grist run conf_ERR.yml summarize_gather summarize_mapping --rerun-incomplete --cores 11
# Print out values of the current jobs SLURM environment variables
env | grep SLURM
# Print out final statistics about resource use before job exits
scontrol show job ${SLURM_JOB_ID}
sstat --format 'JobID,MaxRSS,AveCPU' -P ${SLURM_JOB_ID}.batch
suggest increase --mem=200G
, and rerun
Thanks for posting, jean! There's a way to get genome-grist to pay more attention to memory, and I wanted to get a chance to write it down for future inclusion in the docs as well as to help you out!
In brief, there are only two or three memory-intensive steps in genome-grist. Those are (1) the trim-low-abund
step https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile#L596, and (2) the sourmash prefetch
and (3) gather
steps. Inconveniently these are the longest-running steps, too.
If you run with -j 11
genome-grist will assume that it can run all those steps in parallel, which is why you got an out of memory error - it ran multiple steps of the biggest memory options all at the same time, and the memory multiplied!
So the solution here is to tell genome-grist how much total memory it has to work with so it can limit memory appropriately. You can do this by passing in genome-grist --resources mem_mb=100000
to limit genome-grist to 100 GB.
Another option that we should explore is having genome-grist automatically read in the Slurm memory and CPU limitations and pass them on to snakemake underneath, unless they are overridden on the command line. This should be pretty easy, actually...
I think this should be added to the documentation.