hts-specs
                                
                                
                                
                                    hts-specs copied to clipboard
                            
                            
                            
                        Definitions of structural and “non-structural” variants in VCF
Originally raised by @julesjacobsen as part of discussions in #465
Has there been any further discussion about this? I'd like to propose a slight shift in the way things are defined which would mean that the current 'small' and 'structural' split can be removed/unified leading to a simpler and more robust API.
Definitions
If we define 'variation' as 'a change relative to the reference sequence'. Then, in this context the term 'structural' can be further defined as 'where there is a change in length or arrangement relative to the reference sequence'. Under this definition SNV and MNV are not defined as structural. However, precisely defined, non-symbolic, insertions and deletions are defined a 'structural' regardless of their length. To fit with the existing spec there must be a distinction between 'implicit' and 'explicit' structural variation. Here 'implicit' would be a non-symbolic indel, precisely defined in the REF and ALT fields  and 'explicit' would include the extra fields used to define symbolic structural variants such as SVTYPE,LEN/END and SVLEN.
For example:
#CHROM POS     ID        REF              ALT          QUAL FILTER INFO
1   1   .   A   .   .   PASS    # non-variation
1   1   .   A   T   .   PASS    # A SNV - this isn't a structural variation, but could have an implicit SVTYPE=SNV
1   1   .   A   TC  .   PASS    # an implicit `INS` here `SVTYPE` `LEN` and `SVLEN` can all be calculated from the REF and ALT alleles
1   1   .   AC  T   .   PASS    # an implicit `DEL` here `SVTYPE` `LEN` and `SVLEN` can all be calculated from the REF and ALT alleles
1   2827694 rs2376870   CGTGGATGCGGGGAC C   .   PASS    SVTYPE=DEL;LEN=15;SVLEN=-14 # an explicit, non-symbolic, precise deletion
2   321682  .   T   <DEL>   .   PASS    SVTYPE=DEL;LEN=206;SVLEN=-205;CIPOS=-56,20;CIEND=-10,62 # explicit, symbolic, imprecise deletion
So the only real difference here is that of the implicit and explicit declaration of the SVTYPE, LEN and SVLEN. Explicit types must be used with symbolic alleles and can be used with non-symbolic alleles.
All the other symbolic alleles follow the existing specifications. In order to maintain consistency between symbolic and non-symbolic types using the SVTYPE key the values SNV and MNV would need adding to the reserved set. Strictly SVTYPE ought to be renamed TYPE to be able to cover all the types of variation so applications can treat variation in a uniform manner, but its workable without doing this.
The upshot of this is a clean and consistent API in things like the HTSJDK where a client needs to call VariantContext.getType() and getStructuralType() depending on the return values. For example INDEL or MIXED in the case of getType() not being SNV/MNV or null being returned from getStructuralType() when the ALT allele isn't symbolic. Furthermore it is more in line with the way variation is reported in the HGVS, at least in the way they categorise some variation e.g. Substitution, Deletion, Insertion, Inversion, which marginally helps applications ingesting VCF and reporting findings in HGVS.
Does this sound workable or sheer lunacy?
I think we should be clear in the specifications how symbolic structural variants should be interpreted but not go as far as defining these fields for non-symbolic variants (even implicitly).
While an indel is technically a structural variant thus would have a implicit type, we shouldn't define this in the specifications. Parser and libraries are free to create a unified API and that would be useful to end users but it is outside the scope of the specifications.
The key problem with defining these in the specifications is that the list of symbolic structural alleles in the specification is not exhaustive. Take for example REF=ATATT ALT=AGG. This deletion-with-insertion event does not have a clean mapping to a symbolic structural representation in VCF. If we say that all ALT alleles that involve changes in overall length have a symbolic equivalent, we need to expand the specification to handle these edge cases, something that would signficantly complicate the specifications and I would very much rather not happen.
I see symbolic structural alleles as merely a convience for analysis of the common, simple structural variants. As soon as you start analysing complex variants, you almost invariably end up using a breakpoint graph representation (either explicitly or implicitly) and everything gets converted to BND/CNV anyway.
Strictly SVTYPE ought to be renamed TYPE to be able to cover all the types of variation
I'd rather just deprecate SVTYPE. It provides no additional information over ALT and just bloats the specifications.
Removing SVTYPE works for me - I've been relying on the ALT allele for determining the symbolic allele type. However, would you agree that there needs to be a reserved set of core SV symbols allowed for the ALT allele? Am I right in thinking that this is defined by the application producing the VCF as opposed to being defined? These can be application dependent, e.g. STR:56, <TRA> which makes it hard for software ingesting a VCF file to know what they might find and how to handle them. Defining a core set would make this a lot easier.
The reserved symbolic structural variant alleles are already defined in S1.4.5.
We do need to remove BND as <BND> was never intended to be a valid symbolic allele.
The reserved symbolic structural variant alleles are already defined in S1.4.5.
Oops. Sorry, should RTFM more carefully. This is all good then.