htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

String fields in VCFs/BCFs

Open kgururaj opened this issue 7 years ago • 5 comments

Subject of the issue

Question regarding String fields in VCFs/BCFs Allele specific annotations fields such as AS_RAW_ReadPosRankSum are coded as Strings in VCFs. The header specifies the field as:

##INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">

Data lines contain INFO entries similar to: AS_RAW_ReadPosRankSum=|-0.2,1|NaN

This works fine as long as VariantContext.fullyDecode isn't invoked. Tools such as BCF2Writer do invoke it and at the point I see the following exception:

Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at 20:10000117.  Field had 2 values but the header says this should have 1 values based on header record INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
        at htsjdk.variant.variantcontext.VariantContext.fullyDecodeAttributes(VariantContext.java:1571)
        at htsjdk.variant.variantcontext.VariantContext.fullyDecodeInfo(VariantContext.java:1546)
        at htsjdk.variant.variantcontext.VariantContext.fullyDecode(VariantContext.java:1530)
        at htsjdk.variant.variantcontext.writer.BCF2Writer.add(BCF2Writer.java:197)
        at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1366)
        at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1416)
        at TestGenomicsDBImporterWithMergedVCFHeader.main(TestGenomicsDBImporterWithMergedVCFHeader.java:215)

Questions: Does the VCF standard support fields that are vector of Strings? I couldn't find anything in the spec related to this topic.

  • If yes:
    • Are there delimiters/escape characters to handle this case? Specifically, how do we distinguish between [ "A", "B", "C" ] and [ "A,B,C" ]?
  • If no:
    • Does this mean that the Number descriptor of String fields must always be 1 (or just irrelevant)?
    • Should the String field value not be tokenized as is done currently in the htsjdk VCF codec using "," as the delimiter?

Your environment

  • version of htsjdk: 2.13.2
  • version of java: 1.8.0_151
  • CentOS-7.4

Steps to reproduce

Given a VCF file with allele specific annotation fields that are listed as Strings in the header, convert the file to BCF. For example: https://github.com/broadinstitute/gatk/blob/master/src/test/resources/org/broadinstitute/hellbender/tools/haplotypecaller/expected.testGVCFMode.gatk4.alleleSpecific.g.vcf

Expected behaviour

The String value should not be tokenized

Actual behaviour

String value gets tokenized to List<String>

kgururaj avatar Jan 05 '18 20:01 kgururaj

looks like a problem with GATK; not htsjdk. see https://gatkforums.broadinstitute.org/gatk/discussion/11107/gatk4beta6-annotation-incompatibility-between-haplotypecaller-and-genomicsdbimport

lindenb avatar Jan 05 '18 20:01 lindenb

The issue is in htsjdk (actually I'm not clear if it's an issue or something missing in the VCF spec - see my questions in the description). The "issue" appears in the GATK tool but the root cause is in htsjdk.

kgururaj avatar Jan 05 '18 20:01 kgururaj

Does the VCF standard support fields that are vector of Strings? I couldn't find anything in the spec related to this topic.

Yes. See http://samtools.github.io/hts-specs/VCFv4.3.pdf section 1.6.1.8: INFO - additional information: (String, no semi-colons or equals-signs permitted; commas are permitted only as delimiters for lists of values; characters with special meaning can be encoded using the percent encoding, see Section 1.2; space characters are allowed)

htsjdk supports VCFv4.2 (http://samtools.github.io/hts-specs/VCFv4.2.pdf). The VCFv4.2 version (section 1.4.1.8) does not include percent encoding support for commas so in VCFv4.2 commas are not supported in INFO data at all: INFO - additional information: (String, no white-space, semi-colons, or equals-signs permitted; commas are permitted only as delimiters for lists of values)

Given a VCF file with allele specific annotation fields that are listed as Strings in the header, convert the file to BCF. For example: https://github.com/broadinstitute/gatk/blob/master/src/test/resources/org/broadinstitute/hellbender/tools/haplotypecaller/expected.testGVCFMode.gatk4.alleleSpecific.g.vcf

This looks like an issue with both htsjdk and GATK @lindenb. htsjdk isn't disallowing comma (VCFv4.2) or percentage encoding them (VCFv4.3), and AS_RAW_BaseQRankSum is writing a single-value string INFO field value that includes commas.

d-cameron avatar Jan 17 '18 02:01 d-cameron

Hello Karthik ! Could you please post what tool did you use to provide this case ? If it possible can I use your vcf to test it ? Thank you !

merceneryinbox avatar Mar 27 '18 09:03 merceneryinbox

Thank you @d-cameron as you helped me understand why I was spec-adherent (to VCFv4.3) but getting exceptions parsing my VCFs with HTSJDK. I simply had spaces in my INFO values which are a backwards incompatible allowance in VCFv4.3 vs VCFv4.2.

clintval avatar Jul 29 '21 00:07 clintval