VCF specs - alternative to duplicate <CNV> ALT alleles?
When reviewing the current version of the VCF 4.5 specification, I noticed that the first example given under "5.6 Representing copy number variation" was
chr1 100 . T <CNV>,<CNV> . . SVLEN=30,30;CN=1,2 GT:CN 1/2:3
where the ALT alleles cannot be distinguished without looking at the (optional!) INFO/CN entry.
This can be a major headache for data management. Can we define an alternate representation of this with unique allele codes?
This can be a major headache for data management. Can we define an alternate representation of this with unique allele codes?
What is the data management issue that this introduces? It is the assumption that ALT is unique for each record? Multiple identical ALT alleles already happens for other symbolic alleles (e.g. two <DEL>s that start at the same position and end at different positions) so this isn't new for 4.5.
Can we define an alternate representation of this with unique allele codes?
Implementations are free to use subtypes as they see fit. For example, some groups redundantly encoding the actual ASCN in the ALT field:
chr1 100 . T <CNV:CN1>,<CNV:CN2> . . SVLEN=30,30;CN=1,2 GT:CN 1/2:3
This can be a major headache for data management. Can we define an alternate representation of this with unique allele codes?
What is the data management issue that this introduces? It is the assumption that ALT is unique for each record? Multiple identical ALT alleles already happens for other symbolic alleles (e.g. two
<DEL>s that start at the same position and end at different positions) so this isn't new for 4.5.
Two operations that are affected by this are (i) allele frequency import and (ii) dataset merge. Yes, this applies to multiple-<DEL> as well.
Can we define an alternate representation of this with unique allele codes?
Implementations are free to use subtypes as they see fit. For example, some groups redundantly encoding the actual ASCN in the ALT field:
chr1 100 . T <CNV:CN1>,<CNV:CN2> . . SVLEN=30,30;CN=1,2 GT:CN 1/2:3
Ok, I will recommend this approach to others when the problem comes up.
at least one other way i've seen this encoded "in the wild" is the 1000 genomes SV VCF which adds "RD_CN" to the genotypes field
https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20210124.SV_Illumina_Integration/1KGP_3202.Illumina_ensemble_callset.freeze_V1.vcf.gz
(it would be nice if there were somewhat of a standard as downstream tools can improve plotting capabilities on a larger set of VCF files that way :))