gatk
gatk copied to clipboard
CombineGVCFs produces incorrectly ordered output.
Hi GATK,
We appear to have run in to a very rare corner case in CombineGVCFs that causes it to produce incorrect output. Unfortunately our input data is clinical, so we are limited in what we can give you, but hopefully the following is enough to identify the code path. The reported behaviour is deterministic and the same on 4.2.6.1 and 4.4.0.0.
We have a regions.bed file with the following:
chr20 24024757 26348364
chr20 30456078 30744370
chr20 47896109 57063873
chr20 57064279 57357555
chr21 6789086 6934218
chr21 8395591 8396751
chr21 8522361 8595669
chr22 12868138 12904787
The invocation we have is:
CombineGVCFs \
-R /reference-data/hg38/v0/Homo_sapiens_assembly38.fasta \
--variant variants/person1.g.vcf.gz \
--variant variants/person2.g.vcf.gz \
--variant variants/person3.g.vcf.gz \
-L regions.bed \
-O variants/output.family.gvcf.gz
This deterministically fails with the following (clipped) output:
...
10:21:10.298 INFO ProgressMeter - chr20:48487060 7.0 17932000 2558394.5
10:21:20.301 INFO ProgressMeter - chr20:49481048 7.2 18414000 2566119.1
10:21:30.309 INFO ProgressMeter - chr20:50446168 7.3 18907000 2574967.6
10:21:40.319 INFO ProgressMeter - chr20:51538831 7.5 19404000 2583944.2
10:21:50.326 INFO ProgressMeter - chr20:52627754 7.7 19903000 2592819.4
10:22:00.327 INFO ProgressMeter - chr20:53836620 7.8 20405000 2601710.7
10:22:10.330 INFO ProgressMeter - chr20:55172242 8.0 20878000 2606611.2
10:22:20.334 INFO ProgressMeter - chr20:56528551 8.2 21358000 2612168.0
10:22:28.081 INFO CombineGVCFs - Shutting down engine
[20 May 2025 at 10:22:28 am AEST] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 8.90 minutes.
Runtime.totalMemory()=1157627904
org.broadinstitute.hellbender.exceptions.GATKException: Exception thrown at chr21:6789088 [VC /misc/neur2-archie_main/cpipe-wgs/batches/R0111_gene_steps/64812f4da53bce05ec5e0b9724a8ddb25c3acf88/variants/25W000035-FAM006067.merge.dragen.g.vcf.gz @ chr21:6789088-6789099 Q. of type=SYMBOLIC alleles=[T*, <NON_REF>] attr={END=6789099} GT=GT:DP:GQ:MIN_DP:PL ./.:1:3:1:0,3,45 filters=
at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$traverse$1(MultiVariantWalker.java:145)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.traverse(MultiVariantWalker.java:136)
at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:165)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1098)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: java.lang.IllegalArgumentException: Features added out of order: previous (TabixFeature{referenceIndex=14, start=57357555, end=57357557, featureStartFilePosition=26677607118005, featureEndFilePosition=-1}) > next (TabixFeature{referenceIndex=14, start=6789086, end=6789086, featureStartFilePosition=26677607118451, featureEndFilePosition=-1})
at htsjdk.tribble.index.tabix.TabixIndexCreator.addFeature(TabixIndexCreator.java:89)
at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.add(IndexingVariantContextWriter.java:203)
at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:242)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.endPreviousStates(CombineGVCFs.java:420)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.mergeWithNewVCs(CombineGVCFs.java:338)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.apply(CombineGVCFs.java:177)
at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.apply(MultiVariantWalkerGroupedOnStart.java:133)
at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.apply(MultiVariantWalkerGroupedOnStart.java:108)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$traverse$1(MultiVariantWalker.java:139)
... 21 more
This is coming from the index creation code. If we add -OVI FALSE, the CombineGVCFs succeeds, but a subsequent IndexFeatureFile on the result fails with basically the same error.
We've validated the inputs with ValidateVariants and various invocations of bcftools to check that the source VCFs and their indexes are valid.
When we view the output VCF, when we run with -OVI FALSE, we can see the problem right away:
chr20 57357546 . T C,<NON_REF> . . BaseQRankSum=0.855;DP=110;ExcessHet=0;MQRankSum=-3.81;RAW_MQandDP=116800,33;ReadPosRankSum=0.643 GT:AD:DP:GP:GQ:MIN_DP:PG:PGT:PID:PL:PS:SB ./.:.:50:.:99:39:.:.:.:0,107,1574,107,1574,1574:.:. .|.:31,2,0:33:0,43.77,86.78,122,1374.77,165.01:44:.:0,34.77,37.78,29,63.77,32.01:0|1:57357518_A_AGG:0,9,49,93,1311,133:57357518:12,19,1,1 ./.:.:38:.:87:38:.:.:.:0,87,1305,87,1305,1305:.:.
chr20 57357547 . G A,<NON_REF> . . BaseQRankSum=0.796;DP=111;ExcessHet=0;MQRankSum=-3.81;RAW_MQandDP=116800,33;ReadPosRankSum=0.642 GT:AD:DP:GP:GQ:MIN_DP:PG:PGT:PID:PL:PS:SB ./.:.:50:.:99:39:.:.:.:0,107,1574,107,1574,1574:.:. .|.:31,2,0:33:0,43.77,86.78,122,1374.77,165.01:44:.:0,34.77,37.78,29,63.77,32.01:0|1:57357518_A_AGG:0,9,49,93,1311,133:57357518:12,19,1,1 ./.:.:39:.:90:39:.:.:.:0,90,1350,90,1350,1350:.:.
chr20 57357548 . G <NON_REF> . . END=57357549 GT:DP:GQ:MIN_DP:PL ./.:50:99:39:0,107,1574 ./.:33:84:33:0,84,1260 ./.:44:99:44:0,105,1575
chr20 57357550 . G A,<NON_REF> . . BaseQRankSum=0.796;DP=116;ExcessHet=0;MQRankSum=-3.81;RAW_MQandDP=116800,33;ReadPosRankSum=0.642 GT:AD:DP:GP:GQ:MIN_DP:PG:PGT:PID:PL:PS:SB ./.:.:50:.:99:39:.:.:.:0,107,1574,107,1574,1574:.:. .|.:31,2,0:33:0,43.77,86.78,123,1375.77,166.01:44:.:0,34.77,37.78,30,64.77,33.01:0|1:57357518_A_AGG:0,9,49,93,1311,133:57357518:12,19,1,1 ./.:.:44:.:99:44:.:.:.:0,105,1575,105,1575,1575:.:.
chr20 57357551 . T G,A,<NON_REF> . . BaseQRankSum=-0.358;DP=131;ExcessHet=0;MQRankSum=0;RAW_MQandDP=469600,131;ReadPosRankSum=0.902 GT:AD:DP:GP:GQ:PG:PL:SB ./.:24,23,0,0:47:40.23,0,43.01,867.26,961.03,910.27:38:0,34.77,37.78,39.03,73.8,42.04:75,0,40,863,922,903,863,922,903,903:12,12,12,11 ./.:13,16,2,0:31:40.23,0,42.01,470,435.77,478.01,545.23,513,1130,556.24:38:0,34.77,37.78,34.77,69.54,37.78,30,64.77,64.77,33.01:75,0,39,470,401,475,550,483,1100,558:7,6,6,12 ./.:16,21,0,0:37:40.23,0,42.01,749.26,602.03,644.27:38:0,34.77,37.78,39.03,73.8,42.04:75,0,39,745,563,637,745,563,637,637:9,7,8,13
chr20 57357552 . T <NON_REF> . . END=57357554 GT:DP:GQ:MIN_DP:PL ./.:45:99:45:0,119,1800 ./.:31:84:30:0,84,1260 ./.:37:99:37:0,108,1620
chr20 57357555 . GCC G,<NON_REF> . . BaseQRankSum=1.01;DP=113;DRAGstrInfo=1;DRAGstrParams=39;ExcessHet=0;MQRankSum=-4.313;RAW_MQandDP=106036,31;ReadPosRankSum=-0.301 GT:AD:DP:GP:GQ:MIN_DP:PG:PGT:PID:PL:PS:SB ./.:.:45:.:99:45:.:.:.:0,119,1800,119,1800,1800:.:. .|.:28,3,0:31:0,11,54.01,128,1217,171.01:11:.:0,39,42.01,30,69,33.01:0|1:57357518_A_AGG:28,0,40,126,1176,166:57357518:12,16,2,1 ./.:.:37:.:99:37:.:.:.:0,108,1620,108,1620,1620:.:.
chr20 6789086 . A *,<NON_REF> . . DP=31 GT:AD:DP:GP:GQ:PG:PGT:PID:PL:PS:SB ./.:.:.:.:.:.:.:.:.:.:. .|.:28,3,0:31:0,11,54.01,128,1217,171.01:11:0,39,42.01,30,69,33.01:0|1:57357518_A_AGG:28,0,40,126,1176,166:57357518:12,16,2,1 ./.:.:.:.:.:.:.:.:.:.:.
chr21 6789087 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0
chr21 6789088 . T <NON_REF> . . END=6789089 GT:DP:GQ:MIN_DP:PL ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:1:3:1:0,3,45
chr21 6789090 . T <NON_REF> . . END=6789099 GT:DP:GQ:MIN_DP:PL ./.:1:3:1:0,3,29 ./.:0:0:0:0,0,0 ./.:1:3:1:0,3,45
chr21 6789100 . G <NON_REF> . . END=6789150 GT:DP:GQ:MIN_DP:PL ./.:1:3:1:0,3,29 ./.:0:0:0:0,0,0 ./.:2:6:2:0,6,74
chr21 6789151 . C <NON_REF> . . END=6789156 GT:DP:GQ:MIN_DP:PL ./.:1:3:1:0,3,29 ./.:0:0:0:0,0,0 ./.:3:9:3:0,9,106
chr21 6789157 . G <NON_REF> . . END=6789225 GT:DP:GQ:MIN_DP:PL ./.:2:6:2:0,6,61 ./.:0:0:0:0,0,0 ./.:3:9:3:0,9,106
chr21 6789226 . T <NON_REF> . . END=6789228 GT:DP:GQ:MIN_DP:PL ./.:2:6:2:0,6,61 ./.:0:0:0:0,0,0 ./.:3:6:2:0,6,61
chr21 6789229 . T <NON_REF> . . END=6789238 GT:DP:GQ:MIN_DP:PL ./.:2:3:2:0,3,45 ./.:0:0:0:0,0,0 ./.:3:6:2:0,6,61
chr21 6789239 . T <NON_REF> . . END=6789240 GT:DP:GQ:MIN_DP:PL ./.:3:6:2:0,6,90 ./.:0:0:0:0,0,0 ./.:3:6:2:0,6,61
Notice that the last line for chr20 has the first position for the region from chr21. This yields out-of-order positions for chr20, which causes the HTSLib indexer to throw.
I note that the error line is the start of a new region on a new chromosome, and that it also has a * alt.
Happy to provide what more context I can, allowing for the fact that the samples themselves are clinical.
Tom.