RUFUS icon indicating copy to clipboard operation
RUFUS copied to clipboard

RUFUS creates incomplete VCF header

Open moldach opened this issue 3 years ago • 1 comments

I've noticed the genotype in the FORMAT field is often 0/0, 0/0, 0/0 so there is likely some filtering I need to do.

I was hoping to use gatk VariantFiltration to do this (unsure of other tools I could use); however, I get an error when supplying a VCF from RUFUS:

Key MATEID found in VariantContext field INFO at 1:121485143 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

$ bgzip -d proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf .gz
$ gatk --java-options "-Xmx1G" VariantFiltration -V proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf -O trio_VF.vcf --genotype-filter-expression "isHet == 1" --genotype-filter-name "isHetFilter"
Using GATK jar /gpfs/home/moldach/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx1G -jar /gpfs/home/moldach/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar VariantFiltration -V proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf -O trio_VF.vcf --genotype-filter-expression isHet == 1 --genotype-filter-name isHetFilter
13:37:26.736 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/home/moldach/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Oct 20, 2020 1:37:26 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
13:37:26.882 INFO  VariantFiltration - ------------------------------------------------------------
13:37:26.882 INFO  VariantFiltration - The Genome Analysis Toolkit (GATK) v4.1.9.0
13:37:26.882 INFO  VariantFiltration - For support and documentation go to https://software.broadinstitute.org/gatk/
13:37:26.882 INFO  VariantFiltration - Executing as [email protected] on Linux v3.10.0-514.26.2.el7.x86_64 amd64
13:37:26.883 INFO  VariantFiltration - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_252-b09
13:37:26.883 INFO  VariantFiltration - Start Date/Time: October 20, 2020 1:37:26 MDT PM
13:37:26.883 INFO  VariantFiltration - ------------------------------------------------------------
13:37:26.883 INFO  VariantFiltration - ------------------------------------------------------------
13:37:26.884 INFO  VariantFiltration - HTSJDK Version: 2.23.0
13:37:26.884 INFO  VariantFiltration - Picard Version: 2.23.3
13:37:26.884 INFO  VariantFiltration - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:37:26.884 INFO  VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:37:26.884 INFO  VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:37:26.884 INFO  VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:37:26.884 INFO  VariantFiltration - Deflater: IntelDeflater
13:37:26.884 INFO  VariantFiltration - Inflater: IntelInflater
13:37:26.884 INFO  VariantFiltration - GCS max retries/reopens: 20
13:37:26.884 INFO  VariantFiltration - Requester pays: disabled
13:37:26.884 INFO  VariantFiltration - Initializing engine
13:37:27.185 INFO  FeatureManager - Using codec VCFCodec to read file file:///gpfs/home/moldach/projects/saliva/ensembl/bwa-trio/RUFUS-TRIO/proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf
13:37:27.207 INFO  VariantFiltration - Done initializing engine
13:37:27.258 WARN  VariantFiltration - A variant index will not be created - a sequence dictionary is required to create an output index
13:37:27.281 INFO  ProgressMeter - Starting traversal
13:37:27.281 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
13:37:27.308 INFO  VariantFiltration - Shutting down engine
[October 20, 2020 1:37:27 MDT PM] org.broadinstitute.hellbender.tools.walkers.filters.VariantFiltration done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=1024458752
java.lang.IllegalStateException: Key MATEID found in VariantContext field INFO at 1:121485143 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
	at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:213)
	at htsjdk.variant.vcf.VCFEncoder.write(VCFEncoder.java:146)
	at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:250)
	at org.broadinstitute.hellbender.tools.walkers.filters.VariantFiltration.apply(VariantFiltration.java:354)
	at org.broadinstitute.hellbender.engine.VariantWalker.lambda$traverse$0(VariantWalker.java:104)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.Iterator.forEachRemaining(Iterator.java:116)
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
	at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1049)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)

moldach avatar Oct 20 '20 19:10 moldach

Actually, the genotyping in RUFUS is not very sophisticated yet and I wouldn't worry about removing those. Have you looked at any of these in IGV? If a call is made I would trust that the variant is there and ignore the genotyping. We are working on a more sophisticated method but currently RUFUS just uses allele percentage to make the genotypes. In well behaved regions where for example you have 15 reference kmers and 15 alt kmers it works great. But in highly repetitive regions you may get 15 alt kmers and 1000 reference kmers and RUFUS and the program sees a ref alt balance of 98.5% ref to 1.4% alt and assigns 0/0.

This missing MATEID header id definitely a bug and ill fix that. Thanks for reporting it.

Andrew Farrell PhD
Director of Research and Science
Department of Human Genetics
USTAR Center for Genetic Discovery
Eccles Institute of Human Genetics
University of Utah School of Medicine​ 15 North 2030 East, Room 7140
Salt Lake City, UT 84112-5330
Email: [email protected]
http://marthlab.org/ http://marthlab.org/=====================================

On Oct 20, 2020, at 1:44 PM, Matthew J. Oldach [email protected] wrote:

I've noticed the genotype in the FORMAT field is often 0/0, 0/0, 0/0 so there is likely some filtering I need to do.

I was hoping to use gatk VariantFiltration to do this (unsure of other tools I could use); however, I get an error when supplying a VCF from RUFUS:

Key MATEID found in VariantContext field INFO at 1:121485143 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

$ bgzip -d proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf .gz $ gatk --java-options "-Xmx1G" VariantFiltration -V proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf -O trio_VF.vcf --genotype-filter-expression "isHet == 1" --genotype-filter-name "isHetFilter" Using GATK jar /gpfs/home/moldach/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx1G -jar /gpfs/home/moldach/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar VariantFiltration -V proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf -O trio_VF.vcf --genotype-filter-expression isHet == 1 --genotype-filter-name isHetFilter 13:37:26.736 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/home/moldach/bin/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so Oct 20, 2020 1:37:26 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 13:37:26.882 INFO VariantFiltration - ------------------------------------------------------------ 13:37:26.882 INFO VariantFiltration - The Genome Analysis Toolkit (GATK) v4.1.9.0 13:37:26.882 INFO VariantFiltration - For support and documentation go to https://software.broadinstitute.org/gatk/ 13:37:26.882 INFO VariantFiltration - Executing as [email protected] on Linux v3.10.0-514.26.2.el7.x86_64 amd64 13:37:26.883 INFO VariantFiltration - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_252-b09 13:37:26.883 INFO VariantFiltration - Start Date/Time: October 20, 2020 1:37:26 MDT PM 13:37:26.883 INFO VariantFiltration - ------------------------------------------------------------ 13:37:26.883 INFO VariantFiltration - ------------------------------------------------------------ 13:37:26.884 INFO VariantFiltration - HTSJDK Version: 2.23.0 13:37:26.884 INFO VariantFiltration - Picard Version: 2.23.3 13:37:26.884 INFO VariantFiltration - HTSJDK Defaults.COMPRESSION_LEVEL : 2 13:37:26.884 INFO VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 13:37:26.884 INFO VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 13:37:26.884 INFO VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 13:37:26.884 INFO VariantFiltration - Deflater: IntelDeflater 13:37:26.884 INFO VariantFiltration - Inflater: IntelInflater 13:37:26.884 INFO VariantFiltration - GCS max retries/reopens: 20 13:37:26.884 INFO VariantFiltration - Requester pays: disabled 13:37:26.884 INFO VariantFiltration - Initializing engine 13:37:27.185 INFO FeatureManager - Using codec VCFCodec to read file file:///gpfs/home/moldach/projects/saliva/ensembl/bwa-trio/RUFUS-TRIO/proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf 13:37:27.207 INFO VariantFiltration - Done initializing engine 13:37:27.258 WARN VariantFiltration - A variant index will not be created - a sequence dictionary is required to create an output index 13:37:27.281 INFO ProgressMeter - Starting traversal 13:37:27.281 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 13:37:27.308 INFO VariantFiltration - Shutting down engine [October 20, 2020 1:37:27 MDT PM] org.broadinstitute.hellbender.tools.walkers.filters.VariantFiltration done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=1024458752 java.lang.IllegalStateException: Key MATEID found in VariantContext field INFO at 1:121485143 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:213) at htsjdk.variant.vcf.VCFEncoder.write(VCFEncoder.java:146) at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:250) at org.broadinstitute.hellbender.tools.walkers.filters.VariantFiltration.apply(VariantFiltration.java:354) at org.broadinstitute.hellbender.engine.VariantWalker.lambda$traverse$0(VariantWalker.java:104) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1049) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jandrewrfarrell/RUFUS/issues/17, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSVWHSODQ3L2COH5EOV64TSLXSABANCNFSM4SYZOO6A.

jandrewrfarrell avatar Oct 28 '20 15:10 jandrewrfarrell