implement support for repeats
Originally reported by: Reece Hart (Bitbucket: reece, GitHub: reece)
See comments in hgvs.pymeta.
http://www.hgvs.org/mutnomen/recs-DNA.html#var
Lots of gotchas:
- reported as start position w/seq, or interval without seq. e.g., g.123TG[4], but not g.123_124TG[4]
- interleaved repeats: g.456TG[4]TA[9]TG[3] or g.456_465[4]466_489[9]490_499[3]
- () around repeat count -> uncertainty
- implicit data for hets: g.1209_4523[14];[23] (same as g.[1209_4523[14];1209_4523[23]]
- repeats counts may be uncertain. Although not in the mutnomen doc, I think these are legit: ((6)_22), (?_22).
This definitely merits a feature branch.
Links
- imported from: CORE-113 (Invitae access required)
- Bitbucket: https://bitbucket.org/biocommons/hgvs/issue/113
@reece would it be reasonable to break this issue up into a separate ticket for each of the above "gotchas" bullet points and tackle them separately or some logical order that would allow us to get the first bullet resolved and then move on to a refactoring to handle the subsequent concerns?
@andreasprlic This is the ticket you and I discussed this morning. I would love to do an MVP of repeat syntax support in the hgvs package if that is possible. @reece rightfully points out the nuances and gotchas involved with this syntax. I'm looking for the basics like supporting only the first bullet he points out.
Here's the way I think we could/should deliver this feature in this module.
First, the basics
- reported as start position w/seq, or interval without seq. e.g., g.123TG[4], but not g.123_124TG[4]
Second, ranges
- repeats counts may be uncertain. Although not in the mutnomen doc, I think these are legit: ((6)_22), (?_22).
Again, while the parens () convey uncertainty there are many examples in the wild of not using the parenthesis on ranges. I'd like to make sure we can support these non-compliant representations since they are fairly prevalent.
Here's a few examples from clinvar....
- NM_001081560.3(DMPK):c.*224CTG[51_?]
- NM_000044.6:c.171GCA[10_36]
- NG_008845.2:g.6725GAA[(200_900)]
- NM_013437.5:c.-102CGG[(90_?)]
Third, complex repeats
- interleaved repeats: g.456TG[4]TA[9]TG[3] or g.456_465[4]466_489[9]490_499[3]
Fourth, genotypes / hets
- implicit data for hets: g.1209_4523[14];[23] (same as g.[1209_4523[14];1209_4523[23]]
@andreasprlic Any assistance or management you can assist to break this up so we can start delivering on it would be wonderful.