exonerate
exonerate copied to clipboard
Alignment issues with proteins containing selenocysteine
When I attempt to align SELENON RefSeq transcript NM_020451.2 with it's corresponding protein sequence (NP_065184.2), exonerate reports:
** FATAL ERROR **: Unknown amino acid [U]
exiting ...
This suggests that selenocysteine is not being handled correctly in spite of me using the most recent patched version:
exonerate from exonerate version 2.4.0
Using glib version 2.56.1
Built on Apr 25 2023
Any guidance on this would be appreciated.
@leicray Thanks for the report. Was this error observed when using exonerate built from the master branch? (see https://github.com/nathanweeks/exonerate/issues/5#issuecomment-1068141590)
Thanks for the reply. I am still waiting to hear back from the sysop who built exonerate for the system that I am using. I will give him a nudge tomorrow and let you know what he says.
exonerate has been rebuilt from the master branch and selenocysteine is now being handled correctly.
However, when I align the transcript CDS with the protein sequence for the SELENON gene, the numbering of the nucleotide sequence is incorrect:
C4 Alignment:
------------
Query: NM_020451.2:56-1828 Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA
Target: NP_065184.2 selenoprotein N isoform 2 [Homo sapiens]
Model: ungapped:dna2protein
Raw score: 3080
Query range: 0 -> 1770
Target range: 0 -> 590
1 : ATGGGCCGGGCCCGGCCGGGCCAACGCGGGCCGCCCAGCCCCGGCCCCGCCGCGCAGCCTCC : 61
MetGlyArgAlaArgProGlyGlnArgGlyProProSerProGlyProAlaAlaGlnProPr
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 : MetGlyArgAlaArgProGlyGlnArgGlyProProSerProGlyProAlaAlaGlnProPr : 21
The number at the end of the first line of the alignment ought to be 62 [(3x20)+2].
Adjusting the --aligmentwidth
parameter results in unpredictable changes to the numbering.
Could you please run the alignment on your own installation of exonerate and let me know what you see? The relevant files are attached:
@leicray Apologies for the slow follow-up---I reproduced the output you observed in my environment. I'll need to spend some additional time in a debugger to understand the code involved and why "61" is being emitted.
Have you had time to debug the numbering issue yet?
Have you had time to debug the numbering issue yet?
I spent a debugging session last summer mostly reacquainting myself with the use of gdb to debug exonerate, but wasn't able to resolve the issue in that time. Due to competing priorities, I won't be able to devote additional effort until at least the end of the year. In the interim, any PRs would be welcomed.
I've submitted a PR for the alignment view problem.
Having submitted that PR, it works for a very old test case I have from Raymond, but not for the test case posted here, it's out by +1 instead of -1 now, so I've withdrawn the PR as a proper fix is clearly more complex.
In case it's useful, I have a comment from one of the original authors Guy Slater about the alignment view, way back in 2011:
I have tried to follow the BLAST style coordinates for alignment display numbering as that is what people expect, but it not the same as normal coordinates.
The code for alignment display over spilt codons is surprising fiddly.
Just in case - you shouldn't be trying to parse the alignment display.
There are much easier to parse and more robust ways of doing this, such as with ryo per-transition output: eg.
--ryo 'BEGIN\n{%Pqb %Ptb %Pqs %Pts %Pl\n}END\n'
@lgretton Thanks for the attempt! I agree that if the goal is to parse exonerate output, --ryo
is the way to go. It would still be nice if the alignment display numbering worked as expected.
Exonerate was written to be used in an analysis pipeline and output to stdout
was something of an afterthought. This was confirmed to me many years ago by Ewan Birney "Deputy Director General of EMBL & Joint Director of EMBL-EBI" who had supervised Guy Slater.
For my purposes I need to view the actual alignments. I will play around with the --aligmentwidth
parameter and see if I can figure out why it sometimes works and other times does not.