htsjdk
htsjdk copied to clipboard
String fields in VCFs/BCFs
Subject of the issue
Question regarding String fields in VCFs/BCFs Allele specific annotations fields such as AS_RAW_ReadPosRankSum are coded as Strings in VCFs. The header specifies the field as:
##INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
Data lines contain INFO entries similar to: AS_RAW_ReadPosRankSum=|-0.2,1|NaN
This works fine as long as VariantContext.fullyDecode isn't invoked. Tools such as BCF2Writer do invoke it and at the point I see the following exception:
Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at 20:10000117. Field had 2 values but the header says this should have 1 values based on header record INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
at htsjdk.variant.variantcontext.VariantContext.fullyDecodeAttributes(VariantContext.java:1571)
at htsjdk.variant.variantcontext.VariantContext.fullyDecodeInfo(VariantContext.java:1546)
at htsjdk.variant.variantcontext.VariantContext.fullyDecode(VariantContext.java:1530)
at htsjdk.variant.variantcontext.writer.BCF2Writer.add(BCF2Writer.java:197)
at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1366)
at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1416)
at TestGenomicsDBImporterWithMergedVCFHeader.main(TestGenomicsDBImporterWithMergedVCFHeader.java:215)
Questions: Does the VCF standard support fields that are vector of Strings? I couldn't find anything in the spec related to this topic.
- If yes:
- Are there delimiters/escape characters to handle this case? Specifically, how do we distinguish between
[ "A", "B", "C" ]and[ "A,B,C" ]?
- Are there delimiters/escape characters to handle this case? Specifically, how do we distinguish between
- If no:
- Does this mean that the Number descriptor of String fields must always be 1 (or just irrelevant)?
- Should the String field value not be tokenized as is done currently in the htsjdk VCF codec using "," as the delimiter?
Your environment
- version of htsjdk: 2.13.2
- version of java: 1.8.0_151
- CentOS-7.4
Steps to reproduce
Given a VCF file with allele specific annotation fields that are listed as Strings in the header, convert the file to BCF. For example: https://github.com/broadinstitute/gatk/blob/master/src/test/resources/org/broadinstitute/hellbender/tools/haplotypecaller/expected.testGVCFMode.gatk4.alleleSpecific.g.vcf
Expected behaviour
The String value should not be tokenized
Actual behaviour
String value gets tokenized to List<String>
looks like a problem with GATK; not htsjdk. see https://gatkforums.broadinstitute.org/gatk/discussion/11107/gatk4beta6-annotation-incompatibility-between-haplotypecaller-and-genomicsdbimport
The issue is in htsjdk (actually I'm not clear if it's an issue or something missing in the VCF spec - see my questions in the description). The "issue" appears in the GATK tool but the root cause is in htsjdk.
Does the VCF standard support fields that are vector of Strings? I couldn't find anything in the spec related to this topic.
Yes. See http://samtools.github.io/hts-specs/VCFv4.3.pdf section 1.6.1.8:
INFO - additional information: (String, no semi-colons or equals-signs permitted; commas are permitted only as delimiters for lists of values; characters with special meaning can be encoded using the percent encoding, see Section 1.2; space characters are allowed)
htsjdk supports VCFv4.2 (http://samtools.github.io/hts-specs/VCFv4.2.pdf). The VCFv4.2 version (section 1.4.1.8) does not include percent encoding support for commas so in VCFv4.2 commas are not supported in INFO data at all:
INFO - additional information: (String, no white-space, semi-colons, or equals-signs permitted; commas are permitted only as delimiters for lists of values)
Given a VCF file with allele specific annotation fields that are listed as Strings in the header, convert the file to BCF. For example: https://github.com/broadinstitute/gatk/blob/master/src/test/resources/org/broadinstitute/hellbender/tools/haplotypecaller/expected.testGVCFMode.gatk4.alleleSpecific.g.vcf
This looks like an issue with both htsjdk and GATK @lindenb. htsjdk isn't disallowing comma (VCFv4.2) or percentage encoding them (VCFv4.3), and AS_RAW_BaseQRankSum is writing a single-value string INFO field value that includes commas.
Hello Karthik ! Could you please post what tool did you use to provide this case ? If it possible can I use your vcf to test it ? Thank you !
Thank you @d-cameron as you helped me understand why I was spec-adherent (to VCFv4.3) but getting exceptions parsing my VCFs with HTSJDK. I simply had spaces in my INFO values which are a backwards incompatible allowance in VCFv4.3 vs VCFv4.2.