hts-specs
hts-specs copied to clipboard
VCF 4.4 END issues
I'm writing up a VCFv4.4 parser and I've run into trouble with END
.
Existing design:
END
is a per-record field due to gVCF-style <*>
and compatibility with existing VCF indexing schemes relying on a Number=1
END
field. Directly encoded in the BCF rlen
field.
In 4.4, we've move the "authorative" SV field to SVLEN
. There are still a number of unresolved issues:
-
END
/SVLEN
mismatch
chrA 1 svlen_end_mismatch A <DEL> 0 . SVLEN=5;END=10;
Proposed resolution:
- Deprecate
END
for symoblic SV alleles and requireEND
to be a computed for SV symoblic alleles and parses useSVLEN
when there is a mismatch. -
END
should still be written for backward compatability - Disallow more than one ALT allele for
<*>
records (prevents loss of information inALT
- Indexing imprecise SVs
A non-zero CIPOS
or CIEND
means that a SV can end before/after it's index bounds.
Proposed resolution:
- Too difficult to resolve for 4.4. Just update specs to say this is intended behaviour/a known issue.
-
CIEND
If we've deprecating END
, we should also deprecate CIEND
and CILEN
becomes the authorative field.
This is technically a lossy operation as the following variants are actually different:
chrA 10 . A <DEL> 0 . SVLEN=5;END=14;CIPOS=-2,2;CILEN=-2,2;CIEND= 0,0
chrA 10 . A <DEL> 0 . SVLEN=5;END=14;CIPOS=-2,2;CILEN= 0,0;CIEND=-2,2
In the former, the END position is known exactly and the length is unclear, and the in latter the length is known exactly but the end position is unclear.
Proposed resolutions:
- Deprecate
CIEND
and accept the loss of precision (If you needCIEND
then useBND
notation (but you lose the ability to specifyCILEN
but that's a bigger discussion about how to specify howCIPOS
andCIEND
interact)) or - Make
END
aNumber=A
field. This has wider gVCF implications.
In the point 1, I don't understand the proposal to disallow multiple ALT alleles in gVCF fields and this would not be a good thing.
In the point 3, he END
tag must not change the definition to Number=A
, there is too much software using Number=1
.
Easiest seems to be to make END
the maximum of SVLEN
, in fact this is how it is often used. I cannot comment on the indexing of imprecise SVs.
In the point 1, I don't understand the proposal to disallow multiple ALT alleles in gVCF fields and this would not be a good thing.
I should have been clearer. The intent is to disallow <*>
in conjuction with an SV symbolic allele. This issue is that when you combine
chrA 1 . A <*> 0 . END=10
chrA 1 . A <DEL> 0 . SVLEN=20;END=20
into a single record you get:
chrA 1 . A <*>,<DEL> 0 . SVLEN=.,20;END=20
and lose the END=10
from the <*>
allele with no way to recover it. We need a way to prevent merging being valid VCF because it's not a lossless operation.
Having now implemented this into the dev branch of StructuralVariantAnnotation
I think we should keep both CILEN
and CIEND
. It does mean that records can be inconsistent, but it allows for different left/right bounds (instead of the right bound always being at least the size of the left bound). My preference is for the following clarifications:
- end bounds match starting
CIPOS
bounds if neitherCILEN
orCIEND
are present -
CIEND
takes priority overCILEN
- end position of
<INS>
events are unaffected byCILEN
https://github.com/PapenfussLab/StructuralVariantAnnotation/blob/master/tests/testthat/test-extensions-VCF.R#L455
END
clarifications now included in VCF4.4