seqan3
seqan3 copied to clipboard
How do I use the FM-index for string graph construction?
Question
I've been poring over the documentation and I'm still can't work out if I can use the FM-index API for string graph construction (in a similar manner to how SGA uses its FM-index). Is this possible?
fm_index_cursor
looks very close to what I'm after but can't seem to work out how to:
a) limit the results to a string suffix/prefix search (i.e. how do I match the $
string terminator sentinel)
b) extract the (set of) reference_id
s from the cursor
search()
is also close, but I can't seem to:
a) Force suffix/prefix matching
b) exclude contained results
Is there a simple API example somewhere that I'm missing?
Hi,
The current implementation of the (Bi-)FM-Index focuses on approximate string search.
With the fm_index_cursor
, it is theoretically possible to implement a suffix/prefix-based search. But it is not straightforward, and the API is very likely to change in the future (which would make such searches possible).
About the fm_index_cursor
:
a) Enforcing prefix/suffix matches i currently not possible. The reason is that $
is such a special character that it was not expected to be inside any search.
b) Alternatively, the following should work: Every cursor has a locate
function which returns a vector of pairs (similar to std::vector<std::pair<int, int>>
). The first value is the reference/sequence ID and the second value the position inside the reference/sequence.
About the search()
:
a) It is not possible to force suffix/prefix matching. This would be a nice feature!
b) The alternative approach that should work for the cursor is not possible and currently unlikely to be added. I don't know how to use this inside the search for optimization, so it has to be done in a secondary step.
May I ask what kind of data you are using? (DNA?). And do you know roughly how many sequences you are planning to put into your FM-Index?
The reason is that $ is such a special character that it was not expected to be inside any search.
I tried to see if there was an alphabet option for it but I couldn't find one.
May I ask what kind of data you are using? (DNA?).
Short read sequencing data. I was hoping to use an off-the-shelf suffix tree to perform high-speed overlap for OLC assembly. Finding all exact overlaps can be reduced to a traversal of the tree thus using fm_index_cursor
looked very promising.
And do you know roughly how many sequences you are planning to put into your FM-Index?
Usually less than 1k reads, not more than 100k (I can make multiple 64k indexes if 2^16 is a meaningful limit).
Every cursor has a locate function which returns a vector of pairs
All the sequences are expected to be quite similar so this traversal could be expensive. Are the records reported by locate
ordered in any way? For example, if they're sorted by position then I turn it into a suffix overlap query by calling lazy_locate()
and stopping when position >0
Custom Alphabet
One way you can achieve your goals would be to implement your own alphabet, with an extra delimiter. You can follow this cookbook entry on how to write a custom alphabet: https://docs.seqan.de/seqan/3-master-user/cookbook.html#cookbook_custom_dna4_alphabet
If you have a custom alphabet ACGT$
you have to build your Index over your reads with prepend/appended your reads with $
. With this, you can use the normal search
function. Just make sure you prepend/append all your queries with your new $
symbol: https://docs.seqan.de/seqan/3-master-user/tutorial_index_search.html
This should work well for prefix/suffix searches without errors.
Locate
The locate
step should return the result in alphabetical order. (Maybe alphabetical order of the reversed words, not sure about this detail).
How expensive locate
is depends on the sampling rate of the SuffixArray inside the FM-Index. By default, it is set to 16
. This is a template parameter. If you want to set a different sampling rate, you must define the template parameters of the FM-Index manually:
// default configuration
using IndexSpec = sdsl::csa_wt<sdsl::wt_blcd<sdsl::bit_vector,
sdsl::rank_support_v<>,
sdsl::select_support_scan<>,
sdsl::select_support_scan<0> >,
16, // sampling rate change this to a smaller/larger number
10000000,
sdsl::sa_order_sa_sampling<>,
sdsl::isa_sampling<>,
sdsl::plain_byte_alphabet>;
// FM-Index requires explicit template parameters to be able to specify the sampling rate
seqan3::fm_index<seqan3::dna4, seqan3::text_layout::collection, IndexSpec> myIndex{...content....};
Feel free to reopen or open another issue, if this doesn't seem to be resolved.