seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

[feature suggestion] Reverse translate protein search expression into nucleotide regex or degenerate base sequence

Open samuell opened this issue 2 years ago • 2 comments

Prerequisites

  • [x] make sure you're are using the latest version by seqkit version
  • [x] read the usage

Describe your issue

  • [x] describe the problem
  • [-] provide a reproducible example

I'm having the usecase where I located a small "motif" in a protein sequence, that I'm interested in finding again in the nucleotide sequence coding for the protein.

The sequence I was looking for, expressed as a regex is the following, so let's use that as an example here (. is of course any letter, as per standard regex syntax):

E.SM.YSDN

I would now want to be able to seqkit grep against not only protein sequences, but also nucleotide ones.

By using a genetic code table I can do this by manually converting this sequence into a (DNA) nucleotide regex like this one (where [XY] are character classes allowing any of X and Y in one position):

GA[AG]...AG[CT]ATC...TA[CT]AG[CT]GA[CT]AA[CT]

Now, it would be useful to not need to do this translation manually, but rather be able to do something similar to:

seqkit grep --by-seq -r --protein-to-nucleotide -p "E.SM.YSDN" nucleotide_sequences.fa

Of course, the similar thing could be done using degenerate amino acid / bases too, if that is preferred over regular expressions.

samuell avatar Oct 04 '23 13:10 samuell

That would be achieved, but is tblastn simpler and faster?

shenwei356 avatar Oct 09 '23 15:10 shenwei356

That would be achieved, but is tblastn simpler and faster?

Perhaps! In my own quick try, it seemed that I need to put my query sequence into a file before running it, but there is perhaps some way to do this more easily.

I can explore this option a little more.

samuell avatar Oct 10 '23 15:10 samuell