hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

Clarify padding base for symbolic alleles

Open vcfer opened this issue 3 years ago • 3 comments

In section 2 of VCF 4.4 it is described that:

ALT haplotypes are constructed from the REF haplotype by taking the REF allele bases at the POS in the reference genotype and replacing them with the ALT bases. In essence, the VCF record specifies a-REF-t and the alternative haplotypes are a-ALT-t for each alternative allele.

So the general idea is that an ALT allele is specified for the entire REF locus (including the first position).

This is also the case for for the symbolic allele <*>, but that is stated to be an exception in section 1.6.1 REF:

If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String “<ID>”) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. The exception to this is the <*> symbolic allele for which the reference call interval includes the POS base.

So is it now a general rule for symbolic alleles that they only concern sequence downstream of the first position?

For example, the symbolic allele <R> defined in section 1.4.5 IUPAC ambiguity codes, does it concern the position POS+1?

Instead of symbolic alleles having a distinct definition of which locus the ALT alleles describe, the structural variant symbolic alleles could be defined to include the padding base.

Considering the following alignment, how can the three alternative alleles be specified as symbolic <INS>: -G-GGG reference -GAGGG alt1 AG-GGG alt2 -CAGGG alt3

One way would be to allow the same approach as for breakends, so it would be possible to write:

POS REF ALT
1 G G<INS>,<INS>G,C<INS>

and if no base is specified next to a structural variant symbolic allele it can be interpreted as G<INS>.

vcfer avatar Mar 10 '23 12:03 vcfer

G<INS>,<INS>G,C<INS>

These aren't valid ALT alleles. The wording of that particular sentence in 1.6.1.5 was indeed unclear but was updated in 4.4 to explicitly seperate the "strings of bases" from symbolic alleles with a semicolon to prevent confusion.

d-cameron avatar Jun 20 '23 13:06 d-cameron

This is also the case for for the symbolic allele <*>, but that is stated to be an exception in section 1.6.1 REF:

The way it is currently worded does indeed make UIPAC symoblic alleles unusable since a padding base is required yet there's no way to include the padding base in the UIPAC ALT allele.

Currently symbolic alleles can be any of:

  • Structural symbolic alleles. (<DEL>, <DUP>, <INV>, <INS>, <CNV> and any subtype of these)
  • Symbolic star allele <*> (alias <NON_REF>)
  • UIPAC symbolic alleles
  • Breakpoint assemblies ( 1.4.6/5.4.1 e.g. <ctg1>)
  • Implementation-defined wildcards (e.g. <NON_REF> before it was specifications-defined)

Should a padding base be used?

Type Expected Current behaviour
Structural Padding Padding (default)
Structural No Padding No padding (exception)
UIPAC No Padding Padding - Broken
Assembly alias ? ?
Implementation-defined ? ?

The <ctg1> in the examples in 5.4.2 uses padding but an argument can be made that assembly contig symbolic alleles are most useful if they can be used as literal aliases to the sequence. That is, one should be able to look into the ##assembly fasta file and replace <ctg1> with the sequence of >ctg1. The most general version of this approach is to use no padding, allow mixing of symbolic contig and non-symbolic bases, and allow subsetting thus allowing an ALT of something like C<ctg1:1-100>T<ctg1:100-200>GG. I'm not sure if either implementers or users have the apetite for this though.

Proposed solution:

  • Change default to no padding
  • Make exception for structural symbolic alleles
  • (?) Make exception for ##assembly contig alleles(?)

While we're at it, we should also:

  • Explicitly state what symbolic alleles are reserved
  • Require an ALT header for all implementation-defined symbolic alleles
  • Require a ##assembly line if symbolic contigs are used anywhere

d-cameron avatar Jun 20 '23 14:06 d-cameron

Two comments for consideration:

1 - The chosen solution should allow structural symbolic alleles to cover the first position of a chromosome. To accomplish this, we could reserve a flag (e.g. EXACTSVPADDING) to specify that symbolic structural variants in that record must be interpreted as having the exact padding specified, i.e. <DUP> is interpreted as <DUP>, as opposed to when the flag is absent and <DUP> is interpreted as N<DUP>. The presence of the flag indicates for <INS> that the insertion is between POS-1 and POS.

2 - In VCF v4.3, an example in section 5.4.2 uses explicit base padding with an assembly contig symbolic allele in the ALT column:

Note: In the special case of the complete insertion of a sequence between two base pairs, it is recommended to use the shorthand notation described above:

#CHROM POS ID REF ALT QUAL FILTER INFO
13 321682 INS0 T C<ctg1> 6 PASS SVTYPE=INS
so, although quite hidden, mixing of symbolic contig and non-symbolic bases is to some extent allowed in versions earlier than v4.4.

vcfer avatar Jun 20 '23 22:06 vcfer

Considering the following alignment, how can the three alternative alleles be specified as symbolic <INS>:

 1 234 ref POS
-G-GGG ref
-GAGGG alt1
AG-GGG alt2
-CAGGG alt3

These are encoded as:

ref 1 alt1 G <INS> SVLEN=1
ref 0 alt2 N <INS> SVLEN=1
ref 1 alt3 G <INS> SVLEN=1

Telomeric pos 0 is defined in 1.6.1.2 Telomeres are indicated by using positions 0 or N+1

If you want to encode the sequence (or the SNV in alt3), you need to use non-symbolic alleles.

For example, the symbolic allele <R> defined in section 1.4.5 IUPAC ambiguity codes, does it concern the position POS+1?

As written, yes it does. <*> is the only exception for having POS included in the span of the symbolic allele definition.

There are currently not plans to allow mixing of symbolic alleles with sequence string or to redefine the padding base out of symbolic alleles. Could have been done in 4.4 when <*> got defined to include it but there wasn't any feedback like this during the public consultation people.

Public consultation period for VCF 4.5 starts tomorrow so, if you have the time, please review it for any issues so we can fix them before it's finalised.

d-cameron avatar Apr 20 '24 11:04 d-cameron