hts-specs
hts-specs copied to clipboard
spec for short tandem repeat varaints
hello I am trying to write several tools as part of a program for interpretation of STR variants. Currently there are different variant callers with different vcf formats that are used widely (ExpansionHunter by illumina, lobSTR etc.) It would be great to have some unifying spec that can be used as a baseline / guideline. Is it possible to add this to the next vcf spec ?
The convention I would suggest:
chr - chromosome ('chr3')
pos - position of nucleotide before repeat begins (123
)
ID - any string or int
ref - nucleotide in POS position (A
)
alt - < STRn > where n is the number of repeats (<STR20>
)
QUAL - any string or int
FILTER - any string or int
INFO
FORMAT
SAMPLE
I've also been thinking about this for the purposes of deciding on an output VCF format for STRling (https://github.com/quinlan-lab/STRling) Adding to this, it would need a repeat unit(s) and ideally some consistent way of representing uncertainty/range in the number of repeat units.
I think there would be little appetite for adding a new different style of value for the ALT field, as that is something that would need to be implemented in all VCF parser implementations. On the other hand, a proposal for common INFO tags to represent the details of STRs would likely be welcomed — e.g., lobSTR's RU INFO tag appears to be a fairly appropriate candidate for blessing in the VCF spec.
See the example in VCFv4.3 §5.3 for the flavour of how tandem duplication variants are represented in REF/ALT currently. See also (parts of) PR #465 for the ongoing work of improving this — you may be interested in contributing to the discussion there (and in associated issues & PRs).
hi @jmarshall , thank you for the reply!
I have started reading the PR and related discussions you suggested. I do agree a lot of the SV suggestions can work well for STR variants, but I am wondering if STR variants are not widely considered different than SV <DUP>
variants in the bioinformatics community and should have some sort of differentiation in the spec.
Also, the example of how tandem duplication variants are currently represented is problematic in my opinion - mainly the long ALT, and a ALT option that is similar to those we know from SV variants would be more appropriate (i.e. <STR>
)
This seems like a great topic for GA4GH VRS/VCF alignment. While VRS currently handles this with Repeated Sequence Expressions, there is an open issue about how to represent compound repeated sequences and it would be great to align any VRS and VCF solutions to this challenge if possible.
h/t to @rhdolin for connecting the dots.
Thanks everyone for the interesting discussion on the call just now — I for one think I have a better understanding of the issues than before! It might be useful if @Talya-dor (and anyone else who'd like to) could add here some examples of the sorts of things they want to represent, to remind us of the examples discussed on the call.
For the non-expert such as myself, the Kutner document linked from ga4gh/vrs#363 (as mentioned in https://github.com/samtools/hts-specs/issues/619#issuecomment-1009244714) is a very informative primer.
For current VCF, IMHO it would be appropriate to represent STRs in info fields, whether that would be split out as per e.g. lobSTR's collection of fields, or (primarily) by a single string field containing the familiar CTG[30]CAG[50]
bracketed repeat count notation (which would need to be parsed by applications, unlike separate info tags that would be parsed primarily by the VCF library).
For future VCF revisions, one idea discussed is that it would be nice if the STR repeat count notation could be used in REF and ALT fields alongside the existing breakend and other notation (assuming the brackets in such notation and those in breakends could be unambiguously parsed):
… REF ALT …
… CAG[25] CAG[34],CAG[38] …
Allowing it in REF (as well as ALT) allows the reference allele to be shown naturally, though there is some interplay with normalisation to be considered.
I guess it would be also interesting/comprehensive to consider complex STR patterns, where one STR site is composed of multiple repeat expansions. For instance: (CAG)[*](CAACAG)(CCG)[*]
. The ExpansionHunter paper provides more details and examples: https://academic.oup.com/bioinformatics/article/35/22/4754/5499079
We have opted for the ComposedSequenceExpression
concept in VRS, which is currently in a community PR review stage: https://github.com/ga4gh/vrs/pull/376. I am in favor of aligning Alleles using ComposedSequenceExpression
to follow a similar convention to the @jmarshall REF ALT proposal.
Implemented in #676