anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[FEATURE REQUEST] Ambiguous DNA code in anvi'o

Open FlorianTrigodet opened this issue 2 years ago • 2 comments

The need

I would like to search for palindromes using the ambiguous DNA code. Like that:

anvi-search-palindromes --dna-sequence WWAGTNNCCNTTNNNNNNNNAANGGNNACTWW

The solution

Here is the nomenclature for the ambiguous DNA code with the associated complement:

Meaning IUPAC Code Complement
A A T
C C G
G G C
T T/U A
A or C M K
A or G R Y
A or T W W
C or G S S
C or T Y R
G or T K M
A or C or G V B
A or C or T H D
A or G or T D H
C or G or T B V
G or A or T or C N N

It could be nice to have these DNA bases somewhere like in contstants.py if needed elsewhere in the codebase. But you are the coder, you know probably know what's best.

FlorianTrigodet avatar Feb 06 '23 14:02 FlorianTrigodet

This is doable, but the only way to do it is to expand a sequence like WWAGTNNCCNTTNNNNNNNNAANGGNNACTWW into a representation like [A,T][A,T]AGT[A,C,G,T][A,C,G,T]CC[A,C,G,T]TT[A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T][A,C,G,T]AA[A,C,G,T]GG[A,C,G,T][A,C,G,T]ACT[A,T][A,T] (which is REALLY equivalent to WWAGTNNCCNTTNNNNNNNNAANGGNNACTWW if you are wondering), and then test every single one of 4,294,967,296 sequences this array represents for the presence of palindromes :/

I'm not sure if it is feasible. If you were to remove the 'N's in the center (i.e., WWAGTNNCCNTTAANGGNNACTWW) to look for perfect overlaps (which means the same thing with different start/stop positions in the original sequence), then the number of combinations to test will go from 4,294,967,296 to 65,536. Which is still not quite feasible.

So we can implement this, but it would be computationally intractable. Anvi'o can give an error if the number of sequences are more than 100 or something and require a --I-know-this-is-not-a-good-idea flag or something.

WHAT SAY YOU?

meren avatar Feb 07 '23 10:02 meren

I want to crash HPCs with my ideas, so let's implement --I-know-this-is-not-a-good-idea for every command!

I see what you mean, but one could argue (me) that I don't care about a regex system in this particular case. If you look for a palindrome in a ambiguous sequence, I want the matches to be between complement only. ARYT is good, but ARYW is not. Straight forward if you think about ambiguous bases as extra nt, but I agree that the concept is a mess when you think about the actual potential possible DNA sequences.

SO I SAY... That we can drop the idea. FINE. I've found a better way to tackle my question. MEME, the motif finder tool, has an option to report only palindromic motifs. I'm mentioning it if someone has the same idea one day.

FlorianTrigodet avatar Feb 07 '23 17:02 FlorianTrigodet