hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

VCF 4.4 END issues

Open d-cameron opened this issue 2 years ago • 3 comments

I'm writing up a VCFv4.4 parser and I've run into trouble with END.

Existing design:

END is a per-record field due to gVCF-style <*> and compatibility with existing VCF indexing schemes relying on a Number=1 END field. Directly encoded in the BCF rlen field.

In 4.4, we've move the "authorative" SV field to SVLEN. There are still a number of unresolved issues:

  1. END/SVLEN mismatch
chrA	1	svlen_end_mismatch	A	<DEL>	0	.	SVLEN=5;END=10;

Proposed resolution:

  • Deprecate END for symoblic SV alleles and require END to be a computed for SV symoblic alleles and parses use SVLEN when there is a mismatch.
  • END should still be written for backward compatability
  • Disallow more than one ALT allele for <*> records (prevents loss of information in ALT
  1. Indexing imprecise SVs

A non-zero CIPOS or CIEND means that a SV can end before/after it's index bounds. Proposed resolution:

  • Too difficult to resolve for 4.4. Just update specs to say this is intended behaviour/a known issue.
  1. CIEND

If we've deprecating END, we should also deprecate CIEND and CILEN becomes the authorative field. This is technically a lossy operation as the following variants are actually different:

chrA	10	.	A	<DEL>	0	.	SVLEN=5;END=14;CIPOS=-2,2;CILEN=-2,2;CIEND= 0,0
chrA	10	.	A	<DEL>	0	.	SVLEN=5;END=14;CIPOS=-2,2;CILEN= 0,0;CIEND=-2,2

In the former, the END position is known exactly and the length is unclear, and the in latter the length is known exactly but the end position is unclear.

Proposed resolutions:

  • Deprecate CIEND and accept the loss of precision (If you need CIEND then use BND notation (but you lose the ability to specify CILEN but that's a bigger discussion about how to specify how CIPOS and CIEND interact)) or
  • Make END a Number=A field. This has wider gVCF implications.

d-cameron avatar Dec 13 '21 04:12 d-cameron

In the point 1, I don't understand the proposal to disallow multiple ALT alleles in gVCF fields and this would not be a good thing.

In the point 3, he END tag must not change the definition to Number=A, there is too much software using Number=1.

Easiest seems to be to make END the maximum of SVLEN, in fact this is how it is often used. I cannot comment on the indexing of imprecise SVs.

pd3 avatar Dec 13 '21 09:12 pd3

In the point 1, I don't understand the proposal to disallow multiple ALT alleles in gVCF fields and this would not be a good thing.

I should have been clearer. The intent is to disallow <*> in conjuction with an SV symbolic allele. This issue is that when you combine

chrA	1	.	A	<*>	0	.	END=10
chrA	1	.	A	<DEL>	0	.	SVLEN=20;END=20

into a single record you get:

chrA	1	.	A	<*>,<DEL>	0	.	SVLEN=.,20;END=20

and lose the END=10 from the <*> allele with no way to recover it. We need a way to prevent merging being valid VCF because it's not a lossless operation.

d-cameron avatar Dec 13 '21 09:12 d-cameron

Having now implemented this into the dev branch of StructuralVariantAnnotation I think we should keep both CILEN and CIEND. It does mean that records can be inconsistent, but it allows for different left/right bounds (instead of the right bound always being at least the size of the left bound). My preference is for the following clarifications:

  • end bounds match starting CIPOS bounds if neither CILEN or CIEND are present
  • CIEND takes priority over CILEN
  • end position of <INS> events are unaffected by CILEN

https://github.com/PapenfussLab/StructuralVariantAnnotation/blob/master/tests/testthat/test-extensions-VCF.R#L455

d-cameron avatar Dec 16 '21 10:12 d-cameron

END clarifications now included in VCF4.4

d-cameron avatar Aug 22 '22 05:08 d-cameron