gatk
gatk copied to clipboard
Memory leak in GenotypeGVCFs with `-all-sites`
Bug Report
Affected tool(s) or class(es)
GenotypeGVCFs with -all-sites
Affected version(s)
- 4.2 through 4.6
Description
We tried to run GenotypeGVCFs from GATK 4.5 with -all-sites on a dataset with 120 samples and GRCh37 as the reference. Each run was limited to a single chromosome. All of them failed after consuming 3 TB of memory. Subsequently, I tried a smaller subset of 8 samples limiting the memory to 32 GB and all the runs failed after 3-10 Mbp depending on the chromosome.
Finally, I randomly picked chromosome 9 and used GATK versions from 4.1 to 4.6 and only 4.1 did not experience the problem. It finished the whole chromosome (141 Mbp) with the max memory usage of around 8 GB. All others failed after 3-6 Mbp (Sorry, I used different memory settings for 4.5, so I did not include it.)
Time is in seconds, memory is in MB.
If I run the same command without -all-sites, the maximum memory usage is around 1.6 GB.
Steps to reproduce
GenomicDB was created using the corresponding GATK version as:
gatk --java-options "-Xmx12000m" GenomicsDBImport --genomicsdb-workspace-path tmp/genomicsdb44/9 \
--genomicsdb-shared-posixfs-optimizations --batch-size 120 --verbosity DEBUG \
-L 9 -V data/gatk/gvcf/9/1.g.vcf.gz -V data/gatk/gvcf/9/2.vcf.gz -V data/gatk/gvcf/9/3.g.vcf.gz \
-V data/gatk/gvcf/9/4.g.vcf.gz -V data/gatk/gvcf/9/5.g.vcf.gz -V data/gatk/gvcf/9/6.g.vcf.gz \
-V data/gatk/gvcf/9/7.g.vcf.gz -V data/gatk/gvcf/9/8.g.vcf.gz
GenotypeGVCFs was run as:
gatk --java-options "-Xmx12g" GenotypeGVCFs -R data/ref/hs37d5.fa.gz \
-V gendb://tmp/genomicsdb44/9 -O data/gatk/variants/9/raw44.vcf.gz -L 9 \
--tmp-dir ./tmp/tmp -all-sites
All runs were performed with resource_monitor and it was instructed to kill the process if it consumes more than 14000 MB of memory. Thus, at least 2 GB was allocated for reading GenomicsDB. The size of the GenomicsDB on disk is around 3.1 GB for versions >=4.2 and 3.0 GB for version 4.1.