gatk-sv END fields

END fields

Open mwalker174 opened this issue 1 year ago • 1 comments

The END info field is currently misused throughout much of the pipeline. According to the VCF specification, the END tag is defined as: end position of variant described in this record (for use with symbolic alleles) This is somewhat vague, but it also defines this key property: For precise variants, END is POS + length of REF allele - 1 By convention, for SVs we use symbolic ALT alleles (e.g. '<DEL>') and simply define the REF allele as a single base. For intra-chromosomal spanning events such as DEL, DUP, and INV, we intentionally defy the second definition and use the first. The spec gives this example:

2 321682 . T <DEL> 6 PASS SVTYPE=DEL;END=321887;SVLEN=-205;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12

That is END=POS+|SVLEN|. Note also that SVLEN is defined as the difference between ALT and REF allele sizes which is negative for deletions, but in our pipeline we take the absolute value. This is all fine for our purposes and doesn't need to be changed; I am simply providing context here.

For INS, the spec states that END should always be equal to POS (POS+1-1) but we use the convention END=POS+1. This is also OK, since some tools need the non-zero length for overlap computations. Also note SVLEN should be the length of the insertion if known (we use the value -1 when it is unknown, which we should also fix but would be the issue of a separate ticket).

However, for other "point" variants including BND, CTX, and sometimes CPX, our use of end is inconsistent and has led to a number of bugs in the past. Currently, END is incorrectly used for the coordinate of the second breakend of the variant. For example, if there is a BND specifying a break from chr1:100 to chr1:200, our VCF will have CHR=chr1, POS=100, CHR2=chr1, and END=200. This convention is problematic, however, for inter-chromosomal events. For example, a BND from chr:100 to chr2:50 will then have END<POS, which causes some tools to crash or silently fix the END tag automatically (as in some versions of pysam).

Proposal: All BND and CTX records should have END=POS+1 and use CHR2/END2 to define the other end of the breakpoint (even if CHR=CHR2). The same convention should be enforced for CPX event types that use CHR2. Note that this will require changes to many parts of the code base, including not only changing the format but also computations performed with these SV types such as overlap checking. I would suggest to start at the end of the pipeline (right before FixEndsRescaleGq in CleanVcfChromosome which corrects END for the final "cleaned" vcf) and work backwards to ensure parity with the current behavior, modulo any bug fixes encountered along the way.

Oct 05 '22 16:10 mwalker174

gatk-sv gatk-sv copied to clipboard

END fields

gatk-sv
gatk-sv copied to clipboard