genometools icon indicating copy to clipboard operation
genometools copied to clipboard

translation code and ambiguity handling

Open satta opened this issue 11 years ago • 4 comments
trafficstars

Currently, the protein translation engine sometimes makes unexpected calls when ambiguity is involved. For example,

$ gt -i
gt (GenomeTools) 1.5.3 (2014-06-19 11:44:22)
> print(gt.translate_dna("nag"))
*

So NAG is translated to a stop codon, while it could be CAG (Q) or AAG(K). The only stop codon possible in this context is TAG, but I would expect the translator to give me X here instead of a stop codon. This currently makes some of my validator scripts give incorrect results, stating that gene models have internal stop codons while they may not.

satta avatar Nov 13 '14 15:11 satta

Any feedback on this? What behaviour would you expect? I propose to return X if a N at a given DNA position would result in any ambiguity. If the N does not change the resulting amino acid, return the amino acid character.

satta avatar Nov 14 '14 16:11 satta

I think that's the most sensible approach.

standage avatar Nov 14 '14 17:11 standage

Sounds like a good appoach. Maybe issue a warning, too.

gordon avatar Nov 14 '14 18:11 gordon

This is actually not as trivial as it may look, as I just found out. At least not if you want to support multiple translation tables and keep your code clean. I have changed the code now in a way that it does not silently pick a wildcard value, but returns an X for wildcards in the first or second base, and checks for wildcard ambiguity in the third base, even if that misses some cases such as YTR which -- in the standard genetic code -- always translates to L. I think making this fully generic would take me too much time for now and also blow up the code. We can keep this ticket open as a feature request if you want.

satta avatar Nov 14 '14 22:11 satta