aviary icon indicating copy to clipboard operation
aviary copied to clipboard

refine_metabat2 running out of memory

Open CJREID opened this issue 8 months ago • 8 comments

Hi Rhys,

I'm running aviary recover via slurm. I'm submitting the pipeline as a single job which then sends off each rule as a new job. The rule refine_metabat2 is running out of memory and being killed by slurm.

JobID           JobName  Partition      User  AllocCPUS   NNodes    Elapsed   TotalCPU      State  MaxVMSize     MaxRSS     ReqMem        NodeList 
------------ ---------- ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- --------------- 
39665747     snakejob.+       defq    rei219         64        1   07:44:52 1-00:52:56 OUT_OF_ME+                             480G            c320 
39665747.ba+      batch                              64        1   07:44:52 1-00:52:56 OUT_OF_ME+ 113578094+ 500873864K                       c320 
39665747.ex+     extern                              64        1   07:44:52  00:00.001  COMPLETED    124080K      1092K                       c320 

The first time it failed, I edited the rule so it would take 480G from my config instead of the default 128G (mem_mb = lambda wildcards, attempt: max(int(config["max_memory"])*1024, 128*1024*attempt)) but it has since failed twice with the same errors. Notably, for both the 128G and 480G runs the MaxRSS was only slightly above the requested memory. I don't have the 128G log file though as it was overwritten by the subsequent run.

Looking at the refine_metabat2.log file, Rosella takes ~ 6 hours to fail on iteration 0 before successfully running with CheckM for two more iterations in about 20 mins. I've cut the CheckM output for brevity.

refine_metabat2.log

INFO: 08:56:20 - Refining iteration 0

INFO: 08:56:26 - Rosella refine iteration 0
[2024-06-17T22:56:26Z INFO  rosella] rosella version 0.5.3
[2024-06-17T22:56:27Z INFO  rosella::refine::refinery] Reading TNF table.
[2024-06-17T22:56:27Z INFO  rosella::refine::refinery] Beginning refinement of 68 MAGs
[2024-06-17T23:10:59Z ERROR rosella::refine::refinery] Last message from flight: Refining bins...
[2024-06-17T23:10:59Z ERROR rosella::refine::refinery] Run in debug mode to get full output.
[2024-06-18T06:18:04Z ERROR rosella] Refine Failed with error: Flight failed with exit code: signal: 9 (SIGKILL)

INFO: 16:18:04 - CheckM iteration 0
Running CheckM on 119 bins
[2024-06-18 16:18:06] INFO: CheckM v1.1.3
[2024-06-18 16:18:06] INFO: checkm lineage_wf -t 64 --pplacer_threads 48 -x fna --tab_table -f data/metabat2_refined//rosella_refined_1/refined_bins//checkm.out data/metabat2_refined//rosella_refined_1/refined_bins/ data/metabat2_refined//rosella_refined_1/refined_bins//checkm
[2024-06-18 16:18:06] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2024-06-18 16:18:07] INFO: Identifying marker genes in 119 bins with 64 threads:

...

[2024-06-18 16:27:33] INFO: QA information written to: data/metabat2_refined//rosella_refined_1/refined_bins//checkm.out
[2024-06-18 16:27:33] INFO: { Current stage: 0:00:07.470 || Total: 0:09:27.247 }
INFO: 16:27:34 - Ending iteration: 0

INFO: 16:27:34 - Rosella refine iteration 1
[2024-06-18T06:27:34Z INFO  rosella] rosella version 0.5.3
[2024-06-18T06:27:35Z INFO  rosella::refine::refinery] Reading TNF table.
[2024-06-18T06:27:36Z INFO  rosella::refine::refinery] Beginning refinement of 17 MAGs

INFO: 16:35:18 - CheckM iteration 1
Running CheckM on 5 bins
[2024-06-18 16:35:19] INFO: CheckM v1.1.3
[2024-06-18 16:35:19] INFO: checkm lineage_wf -t 64 --pplacer_threads 48 -x fna --tab_table -f data/metabat2_refined//rosella_refined_2/refined_bins//checkm.out data/metabat2_refined//rosella_refined_2/refined_bins/ data/metabat2_refined//rosella_refined_2/refined_bins//checkm
[2024-06-18 16:35:19] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2024-06-18 16:35:19] INFO: Identifying marker genes in 5 bins with 64 threads:

...

[2024-06-18 16:37:15] INFO: QA information written to: data/metabat2_refined//rosella_refined_2/refined_bins//checkm.out
[2024-06-18 16:37:15] INFO: { Current stage: 0:00:00.372 || Total: 0:01:55.817 }
INFO: 16:37:15 - Ending iteration: 1

INFO: 16:37:15 - Refining iteration 2

INFO: 16:37:15 - Rosella refine iteration 2
[2024-06-18T06:37:15Z INFO  rosella] rosella version 0.5.3
[2024-06-18T06:37:16Z INFO  rosella::refine::refinery] Reading TNF table.
[2024-06-18T06:37:17Z INFO  rosella::refine::refinery] Beginning refinement of 2 MAGs

INFO: 16:41:02 - CheckM iteration 2
No bins to refine
Skipping refinement
Refinery finished.

Three seconds after the log suggests it has run successfully, the .err file shows that it gets killed for being OOM.

refine_metabat2.err
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=491520, mem_mib=468750, disk_mb=4840, disk_mib=4616
Select jobs to execute...

[Tue Jun 18 08:56:16 2024]
rule refine_metabat2:
    input: data/metabat_bins_2/checkm.out, data/metabat_bins_2/done, data/coverm.cov, /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta
    output: data/metabat2_refined/done
    log: logs/refine_metabat2.log
    jobid: 0
    benchmark: benchmarks/refine_metabat2.benchmark.txt
    reason: Missing output files: data/metabat2_refined/done
    threads: 64
    resources: mem_mb=491520, mem_mib=468750, disk_mb=4840, disk_mib=4616, tmpdir=/tmp/rei219.39665747, runtime=4320

python -c "from __future__ import print_function; import sys, json; print(json.dumps([sys.version_info.major, sys.version_info.minor]))"
Activating conda environment: ../../../../../../.conda/f66cf05325a05bc6eeb3592400331d65_
python /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/scripts/tmpuokzy9yx.rosella_refine.py
Activating conda environment: ../../../../../../.conda/f66cf05325a05bc6eeb3592400331d65_
Not cleaning up /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/scripts/tmpuokzy9yx.rosella_refine.py
[Tue Jun 18 16:41:05 2024]
Finished job 0.
1 of 1 steps (100%) done
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=39665747.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The data/metabat2_refined/final_bins folder has 115 refined bins from iteration 1 and 4 from iteration 2 so it seems to have more or less worked but because the job thinks it has failed after getting killed (see cluster error below - deletes some outputs of refine_metabat2), when I go to rerun the pipeline it starts the rule again and fails again after 6-7 hours.

cluster.err
 Prior to running version 0.9.1, you will need to source: 
 /apps/aviary-genome/0.9.1/aviary_env.sh 
  
 Note: This application uses snakemake and you will need to invoke a profile to 
 run the pipelines on our batch system 
 
 See https://rhysnewell.github.io/aviary/examples and HPC submission 


Loading aviary-genome/0.9.1
  Loading requirement: python/3.9.4 amazon-corretto/21.0.0.35.1 bbmap/39.01
    parallel/20210322 miniconda3/4.9.2 pigz/2.7.0 fastani/1.33 fasttree/2.1.11
    hmmer/3.4.0 gsl/2.6 capnproto-c++/0.9.1 mash/2.3 pplacer/1.1.alpha19
    prodigal/2.6.3 gtdbtk/2.3.2
Warning: '/apps/aviary-genome/0.9.1/conda_pkgs' already in 'pkgs_dirs' list, moving to the top
Warning: '/apps/aviary-genome/0.9.1/conda_envs' already in 'envs_dirs' list, moving to the top
06/18/2024 08:56:09 AM INFO: Time - 08:56:09 18-06-2024
06/18/2024 08:56:09 AM INFO: Command - /apps/aviary-genome/0.9.1/bin/aviary recover --assembly /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta -1 /scratch3/rei219/projects/SBM/data/processed/fastp/SBM_001_005.filt.R1.fastq.gz -2 /scratch3/rei219/projects/SBM/data/processed/fastp/SBM_001_005.filt.R2.fastq.gz -l /scratch3/rei219/projects/SBM/data/raw/seq/DNA/pacbio/S_hifi.fastq.gz --gtdb-path /scratch3/projects/datasets_bioref/aviary/gtdb --eggnog-db-path /scratch3/projects/datasets_bioref/aviary/eggnog --checkm2-db-path /scratch3/projects/datasets_bioref/aviary/checkm2db --singlem-metapackage-path /scratch3/projects/datasets_bioref/aviary/singlem/S4.3.0.GTDB_r220.metapackage_20240523.smpkg.zb --tmpdir /scratch3/rei219/aviary_tmp --snakemake-profile /scratch3/rei219/.config/snakemake/slurm --conda-prefix /scratch3/rei219/.conda --semibin-model soil --skip-abundances --skip-binner vamb -z hifi -n 64 -t 64 -m 480 -o /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg
06/18/2024 08:56:09 AM INFO: Version - 0.9.1
06/18/2024 08:56:09 AM INFO: Configuration file written to /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/config.yaml
06/18/2024 08:56:09 AM INFO: Executing: snakemake --snakefile /apps/aviary-genome/0.9.1/lib/python3.9/site-packages/aviary/modules/Snakefile --directory /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg --cores 64 --rerun-incomplete --keep-going  --rerun-triggers mtime --configfile /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/config.yaml --nolock --profile /scratch3/rei219/.config/snakemake/slurm --retries 0 --conda-frontend mamba --resources mem_mb=491520   --use-conda --conda-prefix /scratch3/rei219/.conda   recover_mags
Using profile /scratch3/rei219/.config/snakemake/slurm for setting default command line arguments.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cluster nodes: 100
Provided resources: mem_mb=491520
Job stats:
job                 count
----------------  -------
checkm2                 1
checkm_das_tool         1
das_tool                1
finalise_stats          1
gtdbtk                  1
recover_mags            1
refine_dastool          1
refine_metabat2         1
singlem_appraise        1
total                   9

Select jobs to execute...

[Tue Jun 18 08:56:13 2024]
rule refine_metabat2:
    input: data/metabat_bins_2/checkm.out, data/metabat_bins_2/done, data/coverm.cov, /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta
    output: data/metabat2_refined/done
    log: logs/refine_metabat2.log
    jobid: 5
    benchmark: benchmarks/refine_metabat2.benchmark.txt
    reason: Missing output files: data/metabat2_refined/done
    threads: 64
    resources: mem_mb=491520, mem_mib=468750, disk_mb=4840, disk_mib=4616, tmpdir=<TBD>, runtime=4320

Submitted job 5 with external jobid '39665747'.
[Tue Jun 18 16:41:10 2024]
Error in rule refine_metabat2:
    jobid: 5
    input: data/metabat_bins_2/checkm.out, data/metabat_bins_2/done, data/coverm.cov, /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta
    output: data/metabat2_refined/done
    log: logs/refine_metabat2.log (check log file(s) for error details)
    conda-env: /scratch3/rei219/.conda/f66cf05325a05bc6eeb3592400331d65_
    cluster_jobid: 39665747

Error executing rule refine_metabat2 on cluster (jobid: 5, external: 39665747, jobscript: /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/tmp.uv_g70nb/snakejob.refine_metabat2.5.sh). For error details see the cluster log and the log files of the involved rule(s).
Removing output files of failed job refine_metabat2 since they might be corrupted:
data/metabat2_refined/done
Exiting because a job execution failed. Look above for error message
chmod g+w /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/log/2024-06-18T085610.315172.snakemake.log
Complete log: .snakemake/log/2024-06-18T085610.315172.snakemake.log
06/18/2024 04:41:21 PM CRITICAL: Command '['snakemake', '--snakefile', '/apps/aviary-genome/0.9.1/lib/python3.9/site-packages/aviary/modules/Snakefile', '--directory', '/scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg', '--cores', '64', '--rerun-incomplete', '--keep-going', '--rerun-triggers', 'mtime', '--configfile', '/scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/config.yaml', '--nolock', '--profile', '/scratch3/rei219/.config/snakemake/slurm', '--retries', '0', '--conda-frontend', 'mamba', '--resources', 'mem_mb=491520', '--use-conda', '--conda-prefix', '/scratch3/rei219/.conda', 'recover_mags']' returned non-zero exit status 1.

Is it likely that the first Rosella iteration failing and the excess memory requirement at the end are linked?

Thanks,

Cam

CJREID avatar Jun 19 '24 01:06 CJREID