seqan3 icon indicating copy to clipboard operation
seqan3 copied to clipboard

How do I use the FM-index for string graph construction?

Open d-cameron opened this issue 2 years ago • 3 comments

Question

I've been poring over the documentation and I'm still can't work out if I can use the FM-index API for string graph construction (in a similar manner to how SGA uses its FM-index). Is this possible?

fm_index_cursor looks very close to what I'm after but can't seem to work out how to: a) limit the results to a string suffix/prefix search (i.e. how do I match the $ string terminator sentinel) b) extract the (set of) reference_ids from the cursor

search() is also close, but I can't seem to: a) Force suffix/prefix matching b) exclude contained results

Is there a simple API example somewhere that I'm missing?

d-cameron avatar Apr 20 '22 05:04 d-cameron

Hi,

The current implementation of the (Bi-)FM-Index focuses on approximate string search. With the fm_index_cursor, it is theoretically possible to implement a suffix/prefix-based search. But it is not straightforward, and the API is very likely to change in the future (which would make such searches possible).

About the fm_index_cursor: a) Enforcing prefix/suffix matches i currently not possible. The reason is that $ is such a special character that it was not expected to be inside any search. b) Alternatively, the following should work: Every cursor has a locate function which returns a vector of pairs (similar to std::vector<std::pair<int, int>>). The first value is the reference/sequence ID and the second value the position inside the reference/sequence.

About the search(): a) It is not possible to force suffix/prefix matching. This would be a nice feature! b) The alternative approach that should work for the cursor is not possible and currently unlikely to be added. I don't know how to use this inside the search for optimization, so it has to be done in a secondary step.

May I ask what kind of data you are using? (DNA?). And do you know roughly how many sequences you are planning to put into your FM-Index?

SGSSGene avatar Apr 25 '22 07:04 SGSSGene

The reason is that $ is such a special character that it was not expected to be inside any search.

I tried to see if there was an alphabet option for it but I couldn't find one.

May I ask what kind of data you are using? (DNA?).

Short read sequencing data. I was hoping to use an off-the-shelf suffix tree to perform high-speed overlap for OLC assembly. Finding all exact overlaps can be reduced to a traversal of the tree thus using fm_index_cursor looked very promising.

And do you know roughly how many sequences you are planning to put into your FM-Index?

Usually less than 1k reads, not more than 100k (I can make multiple 64k indexes if 2^16 is a meaningful limit).

Every cursor has a locate function which returns a vector of pairs

All the sequences are expected to be quite similar so this traversal could be expensive. Are the records reported by locate ordered in any way? For example, if they're sorted by position then I turn it into a suffix overlap query by calling lazy_locate() and stopping when position >0

d-cameron avatar Apr 29 '22 01:04 d-cameron

Custom Alphabet

One way you can achieve your goals would be to implement your own alphabet, with an extra delimiter. You can follow this cookbook entry on how to write a custom alphabet: https://docs.seqan.de/seqan/3-master-user/cookbook.html#cookbook_custom_dna4_alphabet

If you have a custom alphabet ACGT$ you have to build your Index over your reads with prepend/appended your reads with $. With this, you can use the normal search function. Just make sure you prepend/append all your queries with your new $ symbol: https://docs.seqan.de/seqan/3-master-user/tutorial_index_search.html This should work well for prefix/suffix searches without errors.

Locate

The locate step should return the result in alphabetical order. (Maybe alphabetical order of the reversed words, not sure about this detail). How expensive locate is depends on the sampling rate of the SuffixArray inside the FM-Index. By default, it is set to 16. This is a template parameter. If you want to set a different sampling rate, you must define the template parameters of the FM-Index manually:

// default configuration
 using IndexSpec = sdsl::csa_wt<sdsl::wt_blcd<sdsl::bit_vector,
                                               sdsl::rank_support_v<>,
                                               sdsl::select_support_scan<>,
                                               sdsl::select_support_scan<0> >,
                                 16, // sampling rate change this to a smaller/larger number
                                 10000000,
                                 sdsl::sa_order_sa_sampling<>,
                                 sdsl::isa_sampling<>,
                                 sdsl::plain_byte_alphabet>;

// FM-Index requires explicit template parameters to be able to specify the sampling rate
seqan3::fm_index<seqan3::dna4, seqan3::text_layout::collection, IndexSpec> myIndex{...content....};

SGSSGene avatar May 02 '22 10:05 SGSSGene

Feel free to reopen or open another issue, if this doesn't seem to be resolved.

SGSSGene avatar Oct 20 '22 09:10 SGSSGene