gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Investigate potential memory leak in GKL + HaplotypeCaller when running long intervals

Open jamesemery opened this issue 2 years ago • 1 comments

I recently noticed a series of what were evidently memory failures when running HaplotypeCaller on some standard test WGS data when using the exact task used in the warp pipeline here: https://github.com/broadinstitute/warp/blob/develop/pipelines/broad/dna_seq/germline/variant_calling/VariantCalling.wdl. I found that running that wdl with otherwise default inputs except for haplotype_scatter_count being set to 10 (so each node doing approximately 5x as much work as when the default, 50, is set) I would get repeated HaplotypeCaller job failures after a few hours that had the pattern of memory failures. The errors tend to involve HaplotypeCaller abruptly ending without any sort of error message or exception at all (which could indicate the vm is dying):

03:22:15.993 INFO  ProgressMeter -       chr13:18173014            378.6               1419490           3749.0
03:22:26.338 INFO  ProgressMeter -       chr13:18177988            378.8               1419530           3747.4
03:22:36.801 INFO  ProgressMeter -       chr13:18203610            379.0               1419700           3746.1
(END)

Or alternatively it seems to end without the end-of-run messages being output:

23:05:30.662 INFO  ProgressMeter -        chr2:47207099            428.8               1372310           3200.4
23:05:40.859 INFO  ProgressMeter -        chr2:47323745            429.0               1372960           3200.7
23:05:50.896 INFO  ProgressMeter -        chr2:47476709            429.1               1373720           3201.2
Using GATK jar /gatk/gatk-package-4.2.2.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx6933m -Xms6933m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -jar /gatk/gatk-package-4.2.2.0-local.jar HaplotypeCaller [INPUTS]
2022/02/10 23:06:52 Starting delocalization.
2022/02/10 23:06:53 Delocalization script execution started...

These failures appear to be reproducible and happen at about the same point in every run. The fact that increasing the memory or decreasing the interval per shard seems to remove the issue it makes me suspect there might be an issue where HaplotypeCaller is using more memory across longer shard lengths. Given that these are not throwing java garbage collection exceptions makes me suspicious that this might be related to the non-java gkl code thats being run.

jamesemery avatar Feb 23 '22 19:02 jamesemery

To add another wrinkle to this issue. I found that running the exact same set of 10-way GATK jobs succeeded when i doubled the off-java memory from 1GB to 2GB (while keeping the machine memory constant at 8GB). This makes me very strongly suspect the issue is related to the GKL.

jamesemery avatar Feb 24 '22 16:02 jamesemery