gatk icon indicating copy to clipboard operation
gatk copied to clipboard

ValidateVariants memory usage is high when using a gvcf as the interval list

Open meganshand opened this issue 1 year ago • 1 comments

ValidateVariants requires a large amount of memory (>16Gb) to validate a GVCF when another GVCF is used as the interval list. This is not the case if a regular interval list is used instead. This comes up in the production ReblockGVCFs pipeline since we validate the reblocked GVCF using the input (unreblocked) GVCF as the interval list to validate over (with -L). For now we can just use larger memory machines to run this tool, but it is confusing to me why using a ~4Gb GVCF as an interval list would cause such a large increase in memory requirement.

meganshand avatar Dec 08 '23 18:12 meganshand

Upon further investigation (in discussion with @droazen and @lbergelson):

IntervalUtils.featureFileToIntervals holds the full interval list in memory before merging abutting intervals which becomes quite large in the GVCF as interval list case because so many very small intervals could be merged into very large intervals (or entire contigs).

We can't use IntervalMergerIterator because in GATK we can't assume the input intervals are sorted, so the full interval list has to live in memory.

Perhaps we could use an on disk sorting collection? Or do an optimistic merge even if the intervals aren't sorted and then sort and merge them later again. This would help in the GVCF as interval list case, but not provide any benefit if the input isn't sorted.

As a workaround for now, we'll add an argument to Picard's VcfToIntervalList to merge abutting intervals and add that to the command line in the ReblockGVCFs WDL.

@droazen @lbergelson Please add/clarify anything here I missed.

meganshand avatar Mar 25 '24 18:03 meganshand