gatk
gatk copied to clipboard
Investigate potential memory leak in GKL + HaplotypeCaller when running long intervals
I recently noticed a series of what were evidently memory failures when running HaplotypeCaller on some standard test WGS data when using the exact task used in the warp pipeline here: https://github.com/broadinstitute/warp/blob/develop/pipelines/broad/dna_seq/germline/variant_calling/VariantCalling.wdl. I found that running that wdl with otherwise default inputs except for haplotype_scatter_count
being set to 10 (so each node doing approximately 5x as much work as when the default, 50, is set) I would get repeated HaplotypeCaller job failures after a few hours that had the pattern of memory failures. The errors tend to involve HaplotypeCaller abruptly ending without any sort of error message or exception at all (which could indicate the vm is dying):
03:22:15.993 INFO ProgressMeter - chr13:18173014 378.6 1419490 3749.0
03:22:26.338 INFO ProgressMeter - chr13:18177988 378.8 1419530 3747.4
03:22:36.801 INFO ProgressMeter - chr13:18203610 379.0 1419700 3746.1
(END)
Or alternatively it seems to end without the end-of-run messages being output:
23:05:30.662 INFO ProgressMeter - chr2:47207099 428.8 1372310 3200.4
23:05:40.859 INFO ProgressMeter - chr2:47323745 429.0 1372960 3200.7
23:05:50.896 INFO ProgressMeter - chr2:47476709 429.1 1373720 3201.2
Using GATK jar /gatk/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx6933m -Xms6933m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -jar /gatk/gatk-package-4.2.2.0-local.jar HaplotypeCaller [INPUTS]
2022/02/10 23:06:52 Starting delocalization.
2022/02/10 23:06:53 Delocalization script execution started...
These failures appear to be reproducible and happen at about the same point in every run. The fact that increasing the memory or decreasing the interval per shard seems to remove the issue it makes me suspect there might be an issue where HaplotypeCaller is using more memory across longer shard lengths. Given that these are not throwing java garbage collection exceptions makes me suspicious that this might be related to the non-java gkl code thats being run.
To add another wrinkle to this issue. I found that running the exact same set of 10-way GATK jobs succeeded when i doubled the off-java memory from 1GB to 2GB (while keeping the machine memory constant at 8GB). This makes me very strongly suspect the issue is related to the GKL.