gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Memory leak in GenotypeGVCFs with `-all-sites`

Open brisk022 opened this issue 1 year ago • 8 comments

Bug Report

Affected tool(s) or class(es)

GenotypeGVCFs with -all-sites

Affected version(s)

  • 4.2 through 4.6

Description

We tried to run GenotypeGVCFs from GATK 4.5 with -all-sites on a dataset with 120 samples and GRCh37 as the reference. Each run was limited to a single chromosome. All of them failed after consuming 3 TB of memory. Subsequently, I tried a smaller subset of 8 samples limiting the memory to 32 GB and all the runs failed after 3-10 Mbp depending on the chromosome.

Finally, I randomly picked chromosome 9 and used GATK versions from 4.1 to 4.6 and only 4.1 did not experience the problem. It finished the whole chromosome (141 Mbp) with the max memory usage of around 8 GB. All others failed after 3-6 Mbp (Sorry, I used different memory settings for 4.5, so I did not include it.)

memory_usage

Time is in seconds, memory is in MB.

If I run the same command without -all-sites, the maximum memory usage is around 1.6 GB.

Steps to reproduce

GenomicDB was created using the corresponding GATK version as:

gatk --java-options "-Xmx12000m" GenomicsDBImport --genomicsdb-workspace-path tmp/genomicsdb44/9 \
    --genomicsdb-shared-posixfs-optimizations --batch-size 120 --verbosity DEBUG \
    -L 9 -V data/gatk/gvcf/9/1.g.vcf.gz -V data/gatk/gvcf/9/2.vcf.gz -V data/gatk/gvcf/9/3.g.vcf.gz \
    -V data/gatk/gvcf/9/4.g.vcf.gz -V data/gatk/gvcf/9/5.g.vcf.gz -V data/gatk/gvcf/9/6.g.vcf.gz \
    -V data/gatk/gvcf/9/7.g.vcf.gz -V data/gatk/gvcf/9/8.g.vcf.gz

GenotypeGVCFs was run as:

gatk --java-options "-Xmx12g" GenotypeGVCFs -R data/ref/hs37d5.fa.gz \
    -V gendb://tmp/genomicsdb44/9 -O data/gatk/variants/9/raw44.vcf.gz -L 9 \
    --tmp-dir ./tmp/tmp -all-sites

All runs were performed with resource_monitor and it was instructed to kill the process if it consumes more than 14000 MB of memory. Thus, at least 2 GB was allocated for reading GenomicsDB. The size of the GenomicsDB on disk is around 3.1 GB for versions >=4.2 and 3.0 GB for version 4.1.

brisk022 avatar Oct 01 '24 14:10 brisk022