[feature suggestion] Reverse translate protein search expression into nucleotide regex or degenerate base sequence
Prerequisites
- [x] make sure you're are using the latest version by
seqkit version - [x] read the usage
Describe your issue
- [x] describe the problem
- [-] provide a reproducible example
I'm having the usecase where I located a small "motif" in a protein sequence, that I'm interested in finding again in the nucleotide sequence coding for the protein.
The sequence I was looking for, expressed as a regex is the following, so let's use that as an example here (. is of course any letter, as per standard regex syntax):
E.SM.YSDN
I would now want to be able to seqkit grep against not only protein sequences, but also nucleotide ones.
By using a genetic code table I can do this by manually converting this sequence into a (DNA) nucleotide regex like this one (where [XY] are character classes allowing any of X and Y in one position):
GA[AG]...AG[CT]ATC...TA[CT]AG[CT]GA[CT]AA[CT]
Now, it would be useful to not need to do this translation manually, but rather be able to do something similar to:
seqkit grep --by-seq -r --protein-to-nucleotide -p "E.SM.YSDN" nucleotide_sequences.fa
Of course, the similar thing could be done using degenerate amino acid / bases too, if that is preferred over regular expressions.
That would be achieved, but is tblastn simpler and faster?
That would be achieved, but is tblastn simpler and faster?
Perhaps! In my own quick try, it seemed that I need to put my query sequence into a file before running it, but there is perhaps some way to do this more easily.
I can explore this option a little more.