clinvar
clinvar copied to clipboard
VCF files always contain N instead of any other IUPAC code
The variant 15163 has ALT 'YT', which is also correctly encoded in the XML file (note the last output line shown):
$ zgrep -A 12 Acc=\"VCV000015163\" ClinVarFullRelease_2024-0107.xml.gz
<MeasureSet Type="Variant" ID="15163" Acc="VCV000015163" Version="2">
<Measure Type="Indel" ID="30202">
<Name>
<ElementValue Type="Preferred">NM_000518.5(HBB):c.152_153delinsAR (p.Thr51Lys)</ElementValue>
</Name>
<Name>
<ElementValue Type="Alternate">T50K</ElementValue>
<XRef Type="Allelic variant" ID="141900.0073" DB="OMIM"/>
</Name>
<Name>
<ElementValue Type="Alternate">Hb Edmonton</ElementValue>
</Name>
<CanonicalSPDI>NC_000011.10:5226738:AG:YT</CanonicalSPDI>
However, the corresponding VCF file contains the variant with ALT 'NT':
$ zgrep -P "\t15163\t" clinvar_20240107.vcf.gz
11 5247969 15163 AG NT . . ALLELEID=30202;CLNDISDB=.;CLNDN=HEMOGLOBIN_EDMONTON;CLNHGVS=NC_000011.9:g.5247969_5247970delinsYT;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=other;CLNVC=Indel;CLNVCSO=SO:1000032;CLNVI=HBVAR:331|LOVD_3:HBB_004062|OMIM:141900.0073;GENEINFO=HBB:3043|LOC106099062:106099062|LOC107133510:107133510;MC=SO:0001583|missense_variant;ORIGIN=1
This occurs also for multiple other variants in the VCF files. In fact, I have not observed any IUPAC code other than 'N' in the VCF files. Or is this intentional to follow the VCF specification?