hgvs Handle alignments that are not full-length

Originally reported by: Reece Hart (Bitbucket: reece, GitHub: reece)

Issue #346 revealed that NCBI and UTA contain alignments that do no start at the transcript start. hgvs assumed that it does, which led to mapping errors. The immediate fix was to require that alignments start at 0.

The goal for this issue to implement better handling for missing data. Specific goals:

Modify IntervalMapper (or TranscriptMapper) to understand non-zero starts
Disable the use of incomplete alignments by default, but make optional
Check 3' end: requires transcript length, which is not currently available except by fetching the sequence.

A similar problem likely exists for transcripts that do not fully align on the 3' end. However, those will project variants up to that point without error.

Bitbucket: https://bitbucket.org/biocommons/hgvs/issue/348

Aug 01 '16 18:08 reece

Hi Reece,

We recently ran into this issue with a few variants in the specific transcripts highlights in issue #346 . Would you be comfortable if a member of organization took a look into making this improvement and then putting out a PR? If so, do you have any initial thoughts on what solving this would look like in your opinion.

Mar 18 '19 20:03 akeeeshi

Hi @akeeeshi -

I'd be thrilled to have someone address this. A PR (with tests) would be extremely appreciated.

The core of the issue is that hgvs assumes that transcripts align to the genome exon-wise and that they are full length. So, the problem is with transcripts that fail these assumptions. As far as I can tell, they occur in regions that are likely misassembled. A few of the causes that I know of are:

Upstream exons are missing. Therefore, the first aligned exon is not really the first exon (i.e., the alignment doesn't start at sequence position 0).
Dowstream exons are missing. These will work fine up to the stop of the alignment.
Exons that consist of discontinuous alignment spans. That is, there's a region of unaligned sequence in the middle of the exon.

My inclination is that the current code is unnecessarily complex and that the mapper should be rewritten from scratch to account for the failed assumptions. If I were to implement this again, I would rely on mapping based entire on CIGAR strings, augmented to distinguish gaps on both sides of the alignment. (CIGAR strings have an N symbol to account for introns, but there's not symmetric version to account for gaps in the genome due to unaligned regions.)

If you decide to pursue this, please let me know how I can help.

Mar 19 '19 00:03 reece

Appreciate it @reece . We will take a stab at this and let you know if we have any questions or hit any snags.

Mar 19 '19 19:03 akeeeshi

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Feb 27 '24 01:02 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

Mar 06 '24 01:03 github-actions[bot]

hgvs hgvs copied to clipboard

Handle alignments that are not full-length

hgvs
hgvs copied to clipboard