exonerate icon indicating copy to clipboard operation
exonerate copied to clipboard

Alignment issues with proteins containing selenocysteine

Open leicray opened this issue 1 year ago • 10 comments

When I attempt to align SELENON RefSeq transcript NM_020451.2 with it's corresponding protein sequence (NP_065184.2), exonerate reports:

** FATAL ERROR **: Unknown amino acid [U] exiting ...

This suggests that selenocysteine is not being handled correctly in spite of me using the most recent patched version:

exonerate from exonerate version 2.4.0 Using glib version 2.56.1 Built on Apr 25 2023

Any guidance on this would be appreciated.

leicray avatar May 24 '23 10:05 leicray

@leicray Thanks for the report. Was this error observed when using exonerate built from the master branch? (see https://github.com/nathanweeks/exonerate/issues/5#issuecomment-1068141590)

nathanweeks avatar May 29 '23 20:05 nathanweeks

Thanks for the reply. I am still waiting to hear back from the sysop who built exonerate for the system that I am using. I will give him a nudge tomorrow and let you know what he says.

leicray avatar May 29 '23 21:05 leicray

exonerate has been rebuilt from the master branch and selenocysteine is now being handled correctly.

However, when I align the transcript CDS with the protein sequence for the SELENON gene, the numbering of the nucleotide sequence is incorrect:

C4 Alignment: ------------ Query: NM_020451.2:56-1828 Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA Target: NP_065184.2 selenoprotein N isoform 2 [Homo sapiens] Model: ungapped:dna2protein Raw score: 3080 Query range: 0 -> 1770 Target range: 0 -> 590 1 : ATGGGCCGGGCCCGGCCGGGCCAACGCGGGCCGCCCAGCCCCGGCCCCGCCGCGCAGCCTCC : 61 MetGlyArgAlaArgProGlyGlnArgGlyProProSerProGlyProAlaAlaGlnProPr |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 1 : MetGlyArgAlaArgProGlyGlnArgGlyProProSerProGlyProAlaAlaGlnProPr : 21

The number at the end of the first line of the alignment ought to be 62 [(3x20)+2].

Adjusting the --aligmentwidth parameter results in unpredictable changes to the numbering.

Could you please run the alignment on your own installation of exonerate and let me know what you see? The relevant files are attached:

NM_020451.2_cds.fasta.txt NP_065184.2.fasta.txt

leicray avatar Jun 07 '23 10:06 leicray

@leicray Apologies for the slow follow-up---I reproduced the output you observed in my environment. I'll need to spend some additional time in a debugger to understand the code involved and why "61" is being emitted.

nathanweeks avatar Jun 20 '23 22:06 nathanweeks

Have you had time to debug the numbering issue yet?

leicray avatar Nov 09 '23 13:11 leicray

Have you had time to debug the numbering issue yet?

I spent a debugging session last summer mostly reacquainting myself with the use of gdb to debug exonerate, but wasn't able to resolve the issue in that time. Due to competing priorities, I won't be able to devote additional effort until at least the end of the year. In the interim, any PRs would be welcomed.

nathanweeks avatar Nov 11 '23 23:11 nathanweeks

I've submitted a PR for the alignment view problem.

lgretton avatar Nov 30 '23 08:11 lgretton

Having submitted that PR, it works for a very old test case I have from Raymond, but not for the test case posted here, it's out by +1 instead of -1 now, so I've withdrawn the PR as a proper fix is clearly more complex.

In case it's useful, I have a comment from one of the original authors Guy Slater about the alignment view, way back in 2011:

I have tried to follow the BLAST style coordinates for alignment display numbering as that is what people expect, but it not the same as normal coordinates.

The code for alignment display over spilt codons is surprising fiddly.

Just in case - you shouldn't be trying to parse the alignment display.

There are much easier to parse and more robust ways of doing this, such as with ryo per-transition output: eg.

--ryo 'BEGIN\n{%Pqb %Ptb %Pqs %Pts %Pl\n}END\n'

lgretton avatar Nov 30 '23 08:11 lgretton

@lgretton Thanks for the attempt! I agree that if the goal is to parse exonerate output, --ryo is the way to go. It would still be nice if the alignment display numbering worked as expected.

nathanweeks avatar Nov 30 '23 23:11 nathanweeks

Exonerate was written to be used in an analysis pipeline and output to stdout was something of an afterthought. This was confirmed to me many years ago by Ewan Birney "Deputy Director General of EMBL & Joint Director of EMBL-EBI" who had supervised Guy Slater.

For my purposes I need to view the actual alignments. I will play around with the --aligmentwidth parameter and see if I can figure out why it sometimes works and other times does not.

leicray avatar Dec 06 '23 14:12 leicray