genometools
genometools copied to clipboard
translation code and ambiguity handling
Currently, the protein translation engine sometimes makes unexpected calls when ambiguity is involved. For example,
$ gt -i
gt (GenomeTools) 1.5.3 (2014-06-19 11:44:22)
> print(gt.translate_dna("nag"))
*
So NAG is translated to a stop codon, while it could be CAG (Q) or AAG(K). The only stop codon possible in this context is TAG, but I would expect the translator to give me X here instead of a stop codon. This currently makes some of my validator scripts give incorrect results, stating that gene models have internal stop codons while they may not.
Any feedback on this? What behaviour would you expect? I propose to return X if a N at a given DNA position would result in any ambiguity. If the N does not change the resulting amino acid, return the amino acid character.
I think that's the most sensible approach.
Sounds like a good appoach. Maybe issue a warning, too.
This is actually not as trivial as it may look, as I just found out. At least not if you want to support multiple translation tables and keep your code clean. I have changed the code now in a way that it does not silently pick a wildcard value, but returns an X for wildcards in the first or second base, and checks for wildcard ambiguity in the third base, even if that misses some cases such as YTR which -- in the standard genetic code -- always translates to L. I think making this fully generic would take me too much time for now and also blow up the code. We can keep this ticket open as a feature request if you want.