clinvar icon indicating copy to clipboard operation
clinvar copied to clipboard

VCF files always contain N instead of any other IUPAC code

Open Toromtomtom opened this issue 5 months ago • 0 comments

The variant 15163 has ALT 'YT', which is also correctly encoded in the XML file (note the last output line shown):

$ zgrep -A 12 Acc=\"VCV000015163\" ClinVarFullRelease_2024-0107.xml.gz
    <MeasureSet Type="Variant" ID="15163" Acc="VCV000015163" Version="2">
      <Measure Type="Indel" ID="30202">
        <Name>
          <ElementValue Type="Preferred">NM_000518.5(HBB):c.152_153delinsAR (p.Thr51Lys)</ElementValue>
        </Name>
        <Name>
          <ElementValue Type="Alternate">T50K</ElementValue>
          <XRef Type="Allelic variant" ID="141900.0073" DB="OMIM"/>
        </Name>
        <Name>
          <ElementValue Type="Alternate">Hb Edmonton</ElementValue>
        </Name>
        <CanonicalSPDI>NC_000011.10:5226738:AG:YT</CanonicalSPDI>

However, the corresponding VCF file contains the variant with ALT 'NT':

$ zgrep -P "\t15163\t" clinvar_20240107.vcf.gz
11      5247969 15163   AG      NT      .       .       ALLELEID=30202;CLNDISDB=.;CLNDN=HEMOGLOBIN_EDMONTON;CLNHGVS=NC_000011.9:g.5247969_5247970delinsYT;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=other;CLNVC=Indel;CLNVCSO=SO:1000032;CLNVI=HBVAR:331|LOVD_3:HBB_004062|OMIM:141900.0073;GENEINFO=HBB:3043|LOC106099062:106099062|LOC107133510:107133510;MC=SO:0001583|missense_variant;ORIGIN=1

This occurs also for multiple other variants in the VCF files. In fact, I have not observed any IUPAC code other than 'N' in the VCF files. Or is this intentional to follow the VCF specification?

Toromtomtom avatar Jan 09 '24 12:01 Toromtomtom