genometools translation code and ambiguity handling

translation code and ambiguity handling

Open satta opened this issue 11 years ago • 4 comments

trafficstars

Currently, the protein translation engine sometimes makes unexpected calls when ambiguity is involved. For example,

$ gt -i
gt (GenomeTools) 1.5.3 (2014-06-19 11:44:22)
> print(gt.translate_dna("nag"))
*

So NAG is translated to a stop codon, while it could be CAG (Q) or AAG(K). The only stop codon possible in this context is TAG, but I would expect the translator to give me X here instead of a stop codon. This currently makes some of my validator scripts give incorrect results, stating that gene models have internal stop codons while they may not.

Nov 13 '14 15:11 satta

Any feedback on this? What behaviour would you expect? I propose to return X if a N at a given DNA position would result in any ambiguity. If the N does not change the resulting amino acid, return the amino acid character.

Nov 14 '14 16:11 satta

I think that's the most sensible approach.

Nov 14 '14 17:11 standage

Sounds like a good appoach. Maybe issue a warning, too.

Nov 14 '14 18:11 gordon

This is actually not as trivial as it may look, as I just found out. At least not if you want to support multiple translation tables and keep your code clean. I have changed the code now in a way that it does not silently pick a wildcard value, but returns an X for wildcards in the first or second base, and checks for wildcard ambiguity in the third base, even if that misses some cases such as YTR which -- in the standard genetic code -- always translates to L. I think making this fully generic would take me too much time for now and also blow up the code. We can keep this ticket open as a feature request if you want.

Nov 14 '14 22:11 satta

genometools genometools copied to clipboard

translation code and ambiguity handling

genometools
genometools copied to clipboard