anvio
anvio copied to clipboard
[FEATURE REQUEST] Ambiguous DNA code in anvi'o
The need
I would like to search for palindromes using the ambiguous DNA code. Like that:
anvi-search-palindromes --dna-sequence WWAGTNNCCNTTNNNNNNNNAANGGNNACTWW
The solution
Here is the nomenclature for the ambiguous DNA code with the associated complement:
Meaning | IUPAC Code | Complement |
---|---|---|
A | A | T |
C | C | G |
G | G | C |
T | T/U | A |
A or C | M | K |
A or G | R | Y |
A or T | W | W |
C or G | S | S |
C or T | Y | R |
G or T | K | M |
A or C or G | V | B |
A or C or T | H | D |
A or G or T | D | H |
C or G or T | B | V |
G or A or T or C | N | N |
It could be nice to have these DNA bases somewhere like in contstants.py
if needed elsewhere in the codebase. But you are the coder, you know probably know what's best.
This is doable, but the only way to do it is to expand a sequence like WWAGTNNCCNTTNNNNNNNNAANGGNNACTWW
into a representation like [A,T][A,T]AGT[A,C,G,T][A,C,G,T]CC[A,C,G,T]TT[A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T]AA[A,C,G,T]GG[A,C,G,T][A,C,G,T]ACT[A,T][A,T]
(which is REALLY equivalent to WWAGTNNCCNTTNNNNNNNNAANGGNNACTWW
if you are wondering), and then test every single one of 4,294,967,296 sequences this array represents for the presence of palindromes :/
I'm not sure if it is feasible. If you were to remove the 'N's in the center (i.e., WWAGTNNCCNTTAANGGNNACTWW
) to look for perfect overlaps (which means the same thing with different start/stop positions in the original sequence), then the number of combinations to test will go from 4,294,967,296 to 65,536. Which is still not quite feasible.
So we can implement this, but it would be computationally intractable. Anvi'o can give an error if the number of sequences are more than 100 or something and require a --I-know-this-is-not-a-good-idea
flag or something.
WHAT SAY YOU?
I want to crash HPCs with my ideas, so let's implement --I-know-this-is-not-a-good-idea
for every command!
I see what you mean, but one could argue (me) that I don't care about a regex system in this particular case. If you look for a palindrome in a ambiguous sequence, I want the matches to be between complement only. ARYT
is good, but ARYW
is not. Straight forward if you think about ambiguous bases as extra nt, but I agree that the concept is a mess when you think about the actual potential possible DNA sequences.
SO I SAY... That we can drop the idea. FINE. I've found a better way to tackle my question. MEME, the motif finder tool, has an option to report only palindromic motifs. I'm mentioning it if someone has the same idea one day.