hgvs icon indicating copy to clipboard operation
hgvs copied to clipboard

Fix variant region for ins and dup on intron-exon boundary

Open b0d0nne11 opened this issue 1 year ago • 15 comments

Fixes #655

Certain ins or dup variants that occur at the intron/exon boundaries have non-zero offsets but are still coding. This PR ensures that projections return accurate var_p representations for these variants.

Fixing ins variants are pretty straightforward because we know what to insert. Fixing dup variants is more involved because we need to calculate the reference sequence. This reference sequence must be derived from the var_g representation so that we get the correct bases from the intron rather than the previous or following exon. As a result there are some mappings that require AssemblyMapper. In these cases trying to perform the mapping with VariantMapper will throw an HGVSError. In addition, when computing the reference sequence we need to handle strand orientation and avoid normalization.

b0d0nne11 avatar Oct 26 '23 14:10 b0d0nne11

@reece I am wondering who can help review this pull request? I believe the core pdot calculation logic is currently giving incorrect results, and we are under some pressure to fix it.

I suppose the two tasks are:

  1. is there agreement that there currently is an issue that needs fixing?
  2. if so, is this approach acceptable?

The basic problem is-- when material is inserted exactly at the intron/exon boundary, should the inserted material be treated as part of the coding region, or as part of the intron?

We believe it is more correct to treat it as part of the coding sequence, because the entire intron and splicing region is intact. So in principle, splicing can still occur using the original splice site.

The current behavior is inconsistent. Insertions at the exon-intron boundary are treated as coding (because the cdot nomenclature doesn't include any intronic positions). Insertions at the intron-exon boundary are treated as non-coding (because the cdot nomenclature does include intronic positions). But I can't think of any reason these situations would be different biologically.

This PR both (1) makes the behavior consistent (always treats the inserted material as coding), and (2) makes the behavior correct (if you're convinced by the argument above).

This PR also makes hgvs consistent with Mutalyzer. Examples:

  1. NM_004380.3:c.3251-1_3251insA
  • Mutalyzer: NP_004371.2:p.(Ile1084Asnfs*3)
  • hgvs, current: null
  • hgvs, after this PR: NP_004371.2:p.(Ile1084AsnfsTer3)
  1. NM_024529.4:c.132-2_132-1dup
  • Mutalyzer: NP_078805.3:p.(Thr45Glyfs*65)
  • hgvs, current: null
  • hgvs, after this PR: NP_078805.3:p.(Thr45GlyfsTer65)

gostachowiak avatar Nov 15 '23 19:11 gostachowiak

Hi @gostachowiak: thanks for pinging me. I'll take a look this week and comment here.

reece avatar Nov 15 '23 19:11 reece

@reece reminder about this PR. If it would be easier to have a meeting to review together, we would be happy to do that. We will need to consider forking soon (maintaining a separate private version) if fixing this isn't an option. Our users consider this to be a serious flaw.

gostachowiak avatar Nov 29 '23 20:11 gostachowiak

@cassiemk @gostachowiak @b0d0nne11 : I'm very sorry that I let this issue fall off my radar. I appreciate the thoughtfulness of the original issue (#655) and the discussion here, and the completeness of the implementation. I have one question, which I'll pick up in a review in the next few minutes.

And I agree with you that the implementation is inconsistent, or at least not well-defined. Thank you for addressing it.

reece avatar Dec 18 '23 18:12 reece

Thanks for the thorough feedback @reece. I've updated the boundary conditions and tests after reviewing it with @gostachowiak. I believe he'll have a longer update soon. I've also rebased on main to pull in the build fix.

b0d0nne11 avatar Dec 19 '23 15:12 b0d0nne11

We discovered issues with 2 of the tests I added yesterday. I just pushed a fix for those.

b0d0nne11 avatar Dec 20 '23 16:12 b0d0nne11

@reece

Thank you for your comments and questions. It helped us to identify a few issues and clarify our thinking. After making a few tweaks to the pull request, here is where we ended up:

  • This pull request only addresses insertions of bases directly at the intron-exon and exon-intron boundary. It does NOT address duplications of regions that span the boundary.
  • Scenarios. There are 6 ways to have an insertion at an intron-exon or exon-intron boundary. 1. insertion between -1 and 0 2. insertion between 0 and +1 3. duplication of a region that ends at -1 4. duplication of a region that starts at +1 5. duplication of a region that starts at the beginning of an exon 6. duplication of a region that ends at the end of an exon
  • Cases 1-4 correspond to the 4 checks in altseqbuilder.py, where it assigns these as "EXON"
    • Before this PR, these variants would always be assigned as "INTRON" since by definition at least one position has an offset
  • Cases 5 & 6 aren't handled in the pull request because usually in these cases neither position has an offset, so these types of variants are already receiving a pdot value
  • Because this pull request only deals with insertions directly at the boundaries, none of the scenarios in the image you uploaded are covered
  • Here is an image showing types of insertions at boundaries where the numbering corresponds to the numbering above.
    • The arrows for numbers 1 and 2 represent insertions
    • The horizontal bars represent duplicated regions
    • Green = covered by this pull request (i.e. assigned as EXON). Red = not covered
    • In the current updated pull request, all scenarios corresponding to 1-4 are covered
    • For scenarios 5 and 6, which weren't changed in this pull request, there are some large duplications that would still be considered intronic
      • We do not intend to address this, because (1) it seems quite difficult to turn the red lines green given what information we have at this point in the code, and (2) once you start duplicating entire introns and/or exons I think all bets are off anyway, and the output of the hgvs package is no longer a sure guide to what the actual effect will be.
Screen Shot 2023-12-20 at 10 43 53 PM

gostachowiak avatar Dec 21 '23 03:12 gostachowiak

@reece I made some changes based on your feedback but I'm still doing some more testing on my end. I'll comment here once it passes. Thanks!

b0d0nne11 avatar Jan 02 '24 19:01 b0d0nne11

Hey @reece, my testing all checked out fine with the latest changes. Thanks again for your feedback.

b0d0nne11 avatar Jan 03 '24 13:01 b0d0nne11

@reece

Yes, I think you understand correctly-- the result depends on the intronic sequence.

  1. I think the first step is to see if there's agreement on the problem we're trying to solve. If I have variant: c.132-2_132-1dup

This is inserting material after the splice site, in the coding region. (even though both positions in the nomenclature have offsets).
What bases have been inserted? It's unspecified. It just says "dup".

Currently an empty pdot is returned, which I believe is wrong. The inserted material is part of the coding region and will produce a change to the amino acid sequence.

  1. If the current behavior is wrong, then I can think of a couple ways to fix it:
  • Instead of returning an empty pdot, return p.? or some sort of error saying that the pdot cannot be calculated because it depends on the intronic sequence.
  • Or, attempt to calculate the pdot.

To calculate the pdot, we need to know what the duplicated bases are. The only way we could come up with is to look back into the reference sequence.

Our typical use case is take something from a VCF file (chr/start/ref/alt) --> calculate cdot --> calculate pdot. We need this to work. In this situation it seems safe to go back to GRCh37 or GRCh38 to figure out what the duplicated bases actually are, since the VCF file said it explicitly. I think I may be naive about the "arbitrary choice of assembly" problem, possibly because we're just trying to do straightforward GRCh37/38 --> cdot --> pdot. In what situation would this procedure fail?

Tangentially, this appears to be a clear drawback of not listing out the duplicated bases in HGVS nomenclature. It makes c.132-2_132-1dup completely ambiguous, with no way to figure out what was actually inserted.

gostachowiak avatar Mar 05 '24 12:03 gostachowiak

Also, this might be semantics but we're not using an arbitrary assembly in the sense that we're picking GRCh37 or GRCh38 randomly. Rather, we're using the assembly tied to the AssemblyMapper class used to perform the mapping. This is why I had to add that thrown exception to VariantMapper in these cases since VM would give incorrect results that are inconsistent with your choice of AM.

Thinking about this now I might be able to avoid requiring AM if I change the function signature for VariantMapper.c_to_p to c_to_p(self, var_c, alt_ac, alt_aln_method=hgvs.global_config.mapping.alt_aln_method): but getting the alt_ac would still require an assembly AFAIK. This might even be cleaner from an interface perspective than throwing the exception but would probably break more code. Let me know if you want to me try.

b0d0nne11 avatar Mar 05 '24 14:03 b0d0nne11

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Apr 05 '24 01:04 github-actions[bot]

@andreasprlic agreed to investigate here.

reece avatar Apr 15 '24 16:04 reece

Rebased on main

b0d0nne11 avatar May 24 '24 15:05 b0d0nne11

Rebased on main

b0d0nne11 avatar Jun 04 '24 18:06 b0d0nne11

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jul 05 '24 02:07 github-actions[bot]