hgvs Converge variant models for all sequence types

This issue provides a place to collect ideas and comments regarding variant modeling, and particularly a single model that represents all variant and sequence types as coherently as possible.

The problem

The current variant model consists of three components: the sequence accession and the "posedit", which itself consists of a position and an edit. This layout achieves two goals: 1) separate the sequence from the posedit so that, in principle, a variant could consist of multiple posedits on one sequence; 2) distinguish the position from the edit so that each of these may be treated as separable in the grammar and the data model. For example, the edit might be a dup, or the position might be an intronic coordinate (which requires a different model). This layout works well in covering most variant types and sequence types represented by HGVS guidelines.

Unfortunately, the HGVS recommendation effectively interleave protein locations and edits. For a grammar, the consequence is a very significant increase in parsing complexity.

For example, compare these expressions: NP_012345.6:p.Ala20Tyr and NP_012345.6:p.Cys30_Glu32delCysAspGluinsThrValTrp When parsing the substitution, Ala is taken as both the AA at position 26 and the reference nucleotide, and Tyr is the alt. There's no explicit end position. When parsing the delins, Cys and Glu should be associated with the start and end, with a separate reference sequence of CysAspGlu and an alt sequence of ThrValTrp.

That positions are represented with amino acids, sometimes with and sometimes separate from the reference amino acids at that range, creates two essential issues:

The parse tree of posedit := <pos, edit> becomes non-separable: you can't always parse pos independent of edit. In particular, in a a substitution, the (single) parsed start AA (Ala) must be injected in the model twice: once in the pos (start), and once in the ref. This start AA must exist (to be valid).
The start and end ref nucleotides in a delins are each represented twice: once in the location and once as the first/last position of the del sequence. This creates potential ambiguity when parsing, and it makes it unclear which nucleotide in the model is authoritative. Furthermore, the del sequence is optional.

Yes, this is a soluble issue. We can implement HGVS as written. But implied HGVS model as written is especially misguided here (IMO). We should be both practical for users as well as attempt to discourage this sanctioned thinko. A much better course would be to represent protein variants exactly as NA variants are represented.

Bitbucket: https://bitbucket.org/biocommons/hgvs/issue/333

Jun 27 '16 20:06 reece

Original comment by Reece Hart (Bitbucket: reece, GitHub: reece):

I generally eschew implicit behaviors as well. Like you, I'm committed to parsing being just parsing. The validator does some lightweight validation, such as requiring correct sequence residue types (dna, rna, aa) with each of the variant types (c, g, n, r, m, p), admitting intronic positions only with c and n variant types, and requiring consistent 1- or 3- letter amino acids. But, there's definitely no external lookup in the parser (this is optional behavior in the EasyVariantMapper(replace_reference = True | False)).

Interesting idea to report number of residues matched/mismatched. To be clear, do you mean that if you write NM_012345.6.c:20_25delCATCAT and the reference is actual CATCCT, you want the validator to report 5/6 matches? If that's right, please consider making an issue -- that's really easy and doesn't need any discussion.

Jun 27 '16 21:06 reece

Original comment by Kevin Jacobs (Bitbucket: kevinjacobs, GitHub: Unknown):

To add one more consideration: I feel strongly that the parser should not automagically infer nucleotides or amino acids not provided by the input text. That sort of elaboration should be done as an explicit request to the API after initial parsing. (This may be how things done are in the hgvs package now...)

In a similar/related vein, the validator should inform the caller of the number of reference residues that it could verify, so that we can differentiate a meaningfully valid HGVS systematic from ones that makes no reference assertions at all.

Jun 27 '16 21:06 reece

Original comment by Reece Hart (Bitbucket: reece, GitHub: reece):

Long discussion with Kevin Jacobs:

On Fri, Jun 24, 2016 at 8:29 AM, Kevin Jacobs [email protected] wrote:

I'm confused. Why does posedit.edit.ref=''? How do I get the equivalent of ref from p.posedit.pos?

Sorry if this is well documented...

I hate this part of the code.

Ultimately, the cause is that, unlike DNA systematics, proteins interleave reference nucleotides and coordinates. From a parsing perspective, means enumerating and considering all combinations of location and variant types jointly rather than independently (as you can with NA). One way put lipstick on that pig was to create a new position type that parsed the ref AA with the position, but this required NOT parsing the ref into the edit.

Or at least I couldn't figure it out without enumerating all the cases. At the time, we were a bit under the gun so I opted for a the disappointing easy route rather than a more correct and difficult one.

That means that the ref aa is part of the pos (which is why there's an AAPosition) rather than the edit:

#!python

>>> p = hp.parse_hgvs_variant('NP_000010.1:p.Ser158Asn')
>>> p.posedit.pos.start.aa
u'S'
>>> type(p.posedit.pos.start)
<class 'hgvs.location.AAPosition'>

The explanation makes me sad because I worked really hard on a coherent interface that hid lots of complexity.

I do think we can do better that what's there, and perhaps it's not even that hard, but it's not been a priority. The relevant grammar rules are p_posedit and pro_edit. I'd love to hear ideas! I could use a Jacobian insight on this one!

On Sun, Jun 26, 2016 at 11:27 AM, Kevin Jacobs [email protected] wrote:

If I'm on the right track, then it may be useful to consider using a simple position or interval for the pos and creating an AAEditInterval for p. variants that span more than one aa that contains attributes for first_aa, last_aa and alt.

What do you think?

I primarily think that the basic model of reference sequence, position, and edit (or something akin to this tuple), is sound for all nucleotide-level variant descriptions on all sequence types. Having type-specific classes is (IMO) an indication that we're modeling the syntax and not necessarily the concepts, and that vexes me. I was quite dismayed when we had to add the AAPosition type.

Part of the reason I want this is that the generalized edit model, however flawed, is easy for users to understand and manipulate.

So, I'd love to figure out a way to avoid the type-specific classes.

Jun 27 '16 21:06 reece

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Feb 27 '24 01:02 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

Mar 07 '24 01:03 github-actions[bot]