seqan3
seqan3 copied to clipboard
Does FM-Index support ambiguous Amino Acids?
Platform
- SeqAn version: latest
- Operating system: NA
- Compiler: NA
Question
Imagine you want to search for peptide sequences (aka needles/patterns) in a protein (text) using the FM-Index.
The peptides will not contain ambiguous AAs, whereas the proteins might have them.
where
B = N | D
, J = I | L
, Z = E | Q
, X = any
E.g. you want to find peptide AN
in both the proteins AB
and AN
.
Is this possible by thoughtfully choosing an alphabet where ranks of AAs are carefully chosen? (or some other way)?
Inside the BWT you can always only extend by exactly one character.
What you can do, is, of course, perform multiple searches, i.e. make your backtracking that allows only certain errors. The interface for doing this is the fm_index_cursor in SeqAn3. You lose the search schemes though.
Hi @cbielow,
When talking to @eseiler, we concluded that this is not possible with the current design of the FM index in seqan3. The FM index can only perform exact searches, so as @h-2 pointed out, you could use the seqan3::fm_index_cursor
to do that. A custom alphabet would not be possible without ambiguity, so it is not ideal either.
We will close this issue since we don't see a chance that it will be possible in the future either. Feel free to reopen the issue if you still have a question or need anything from us.