hgvs icon indicating copy to clipboard operation
hgvs copied to clipboard

Optionally raise exception for variants within ambiguous indel alignment regions

Open reece opened this issue 8 years ago • 3 comments

Originally reported by: Reece Hart (Bitbucket: reece, GitHub: reece)


If a variant lies within a region of ambiguous indel alignment, raise an exception rather than projecting the variant.

Background

By way of example, here's the cigar from UTA for NM_000348.3:

SRD5A2 │ NM_000348.3 │ NlxDi │ 158=1D193=;164=;102=;151=;1677=

According to this alignment, NM_000348.3 position 159 is deleted in the genome.

However, this representation is ambiguous. The region looks like this:

     158 162
     |   |
     GCCCT  < transcript
     G-CCT  < UTA alignment (158=1D193=)
     GCC-T  < splign alignment (160=1D191=)
     |   |
     |   31805880
     31805883

So, 158=1D193=, 159=1D192=, and 160=1D191= are indistinguishable under shuffling in this homopolymer region.

Aligners will have a bias for left- or right- shuffled indels. When it's ambiguous, there's no right answer. The "correct" projection of a variant across an ambiguous region will depend on which alignment is used. hgvs currently returns only the one from the UTA alignment.

Two things should happen here. First, UTA should use the CIGAR strings from NCBI's gff files. See https://github.com/biocommons/uta/issues/201.

Second, this issue is a feature proposal to identify such regions and raise an exception. (Alternatively, with the warning system from #166, we could emit a warning.)


  • Bitbucket: https://bitbucket.org/biocommons/hgvs/issue/392

reece avatar Dec 22 '16 03:12 reece

Hi Reece, I'm interested in this proposal. Please can I request that any exception raised will take the form of a warning rather than an error which will prevent mapping.

Thanks

Pete

Peter-J-Freeman avatar Mar 21 '17 09:03 Peter-J-Freeman

Related: #166

reece avatar Aug 05 '18 15:08 reece

I'll again put in a plug for turning these ambiguous alignment regions into double-sided gaps in the alignment representation (although CIGAR might not directly support that?), and using the 3'-most alignment position in the direction of the target reference sequence. So when going g_to_c, use the 3'-most alignment position in the direction of the transcript, and when going c_to_g, use the 3'-most alignment position in the direction of the genome. This would give consistent results despite inconsistent transcript alignments. A warning or at least a flag would still be useful since most tools would just use whatever alignment position they happen to get.

AngieHinrichs avatar Sep 20 '18 17:09 AngieHinrichs

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Feb 26 '24 01:02 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Mar 05 '24 01:03 github-actions[bot]