hgvs
hgvs copied to clipboard
Optionally raise exception for variants within ambiguous indel alignment regions
Originally reported by: Reece Hart (Bitbucket: reece, GitHub: reece)
If a variant lies within a region of ambiguous indel alignment, raise an exception rather than projecting the variant.
Background
By way of example, here's the cigar from UTA for NM_000348.3:
SRD5A2 │ NM_000348.3 │ NlxDi │ 158=1D193=;164=;102=;151=;1677=
According to this alignment, NM_000348.3 position 159 is deleted in the genome.
However, this representation is ambiguous. The region looks like this:
158 162
| |
GCCCT < transcript
G-CCT < UTA alignment (158=1D193=)
GCC-T < splign alignment (160=1D191=)
| |
| 31805880
31805883
So, 158=1D193=, 159=1D192=, and 160=1D191= are indistinguishable under shuffling in this homopolymer region.
Aligners will have a bias for left- or right- shuffled indels. When it's ambiguous, there's no right answer. The "correct" projection of a variant across an ambiguous region will depend on which alignment is used. hgvs currently returns only the one from the UTA alignment.
Two things should happen here. First, UTA should use the CIGAR strings from NCBI's gff files. See https://github.com/biocommons/uta/issues/201.
Second, this issue is a feature proposal to identify such regions and raise an exception. (Alternatively, with the warning system from #166, we could emit a warning.)
- Bitbucket: https://bitbucket.org/biocommons/hgvs/issue/392
Hi Reece, I'm interested in this proposal. Please can I request that any exception raised will take the form of a warning rather than an error which will prevent mapping.
Thanks
Pete
Related: #166
I'll again put in a plug for turning these ambiguous alignment regions into double-sided gaps in the alignment representation (although CIGAR might not directly support that?), and using the 3'-most alignment position in the direction of the target reference sequence. So when going g_to_c, use the 3'-most alignment position in the direction of the transcript, and when going c_to_g, use the 3'-most alignment position in the direction of the genome. This would give consistent results despite inconsistent transcript alignments. A warning or at least a flag would still be useful since most tools would just use whatever alignment position they happen to get.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.