seqan3 icon indicating copy to clipboard operation
seqan3 copied to clipboard

Does FM-Index support ambiguous Amino Acids?

Open cbielow opened this issue 4 years ago • 1 comments

Platform

  • SeqAn version: latest
  • Operating system: NA
  • Compiler: NA

Question

Imagine you want to search for peptide sequences (aka needles/patterns) in a protein (text) using the FM-Index.

The peptides will not contain ambiguous AAs, whereas the proteins might have them. where B = N | D, J = I | L, Z = E | Q, X = any

E.g. you want to find peptide AN in both the proteins AB and AN.

Is this possible by thoughtfully choosing an alphabet where ranks of AAs are carefully chosen? (or some other way)?

cbielow avatar Feb 23 '21 13:02 cbielow

Inside the BWT you can always only extend by exactly one character.

What you can do, is, of course, perform multiple searches, i.e. make your backtracking that allows only certain errors. The interface for doing this is the fm_index_cursor in SeqAn3. You lose the search schemes though.

h-2 avatar Mar 01 '21 10:03 h-2

Hi @cbielow,

When talking to @eseiler, we concluded that this is not possible with the current design of the FM index in seqan3. The FM index can only perform exact searches, so as @h-2 pointed out, you could use the seqan3::fm_index_cursor to do that. A custom alphabet would not be possible without ambiguity, so it is not ideal either.

We will close this issue since we don't see a chance that it will be possible in the future either. Feel free to reopen the issue if you still have a question or need anything from us.

smehringer avatar Oct 20 '22 08:10 smehringer