hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

VCF specs - alternative to duplicate <CNV> ALT alleles?

Open chrchang opened this issue 9 months ago • 4 comments

When reviewing the current version of the VCF 4.5 specification, I noticed that the first example given under "5.6 Representing copy number variation" was

chr1 100 . T <CNV>,<CNV> . . SVLEN=30,30;CN=1,2 GT:CN 1/2:3

where the ALT alleles cannot be distinguished without looking at the (optional!) INFO/CN entry.

This can be a major headache for data management. Can we define an alternate representation of this with unique allele codes?

chrchang avatar Mar 23 '25 23:03 chrchang

This can be a major headache for data management. Can we define an alternate representation of this with unique allele codes?

What is the data management issue that this introduces? It is the assumption that ALT is unique for each record? Multiple identical ALT alleles already happens for other symbolic alleles (e.g. two <DEL>s that start at the same position and end at different positions) so this isn't new for 4.5.

Can we define an alternate representation of this with unique allele codes?

Implementations are free to use subtypes as they see fit. For example, some groups redundantly encoding the actual ASCN in the ALT field:

chr1 100 . T <CNV:CN1>,<CNV:CN2> . . SVLEN=30,30;CN=1,2 GT:CN 1/2:3

d-cameron avatar Mar 24 '25 02:03 d-cameron

This can be a major headache for data management. Can we define an alternate representation of this with unique allele codes?

What is the data management issue that this introduces? It is the assumption that ALT is unique for each record? Multiple identical ALT alleles already happens for other symbolic alleles (e.g. two <DEL>s that start at the same position and end at different positions) so this isn't new for 4.5.

Two operations that are affected by this are (i) allele frequency import and (ii) dataset merge. Yes, this applies to multiple-<DEL> as well.

Can we define an alternate representation of this with unique allele codes?

Implementations are free to use subtypes as they see fit. For example, some groups redundantly encoding the actual ASCN in the ALT field:

chr1 100 . T <CNV:CN1>,<CNV:CN2> . . SVLEN=30,30;CN=1,2 GT:CN 1/2:3

Ok, I will recommend this approach to others when the problem comes up.

chrchang avatar Mar 24 '25 03:03 chrchang

at least one other way i've seen this encoded "in the wild" is the 1000 genomes SV VCF which adds "RD_CN" to the genotypes field

https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20210124.SV_Illumina_Integration/1KGP_3202.Illumina_ensemble_callset.freeze_V1.vcf.gz

cmdcolin avatar May 06 '25 16:05 cmdcolin

(it would be nice if there were somewhat of a standard as downstream tools can improve plotting capabilities on a larger set of VCF files that way :))

cmdcolin avatar May 06 '25 16:05 cmdcolin