[Search] Query the text size
For calculating e-values or other purposes it is often necessary to query the text size of a (bi) FM index.
I have experimented with the size() function of seqan3::bi_fm_index<dna4, text_layout::collection> in order to calculate it myself. According to the documentation, the value of size() includes sentinels. Assume that I have stored somewhere a list of sequence names, so I know the value nseq = number of indexed sequences. Then for nseq > 1, I can compute the text size nchar = index.size() - nseq. For nseq == 1 we have a special case with nchar = index.size() - 2 (because a single sequence has 2 sentinels).
I suggest to provide a function get_text_size() for the index that performs these calculations. An issue is that we have to keep track of the number of sequences stored in the index (which I could solve with the length of the names list).
Some ideas, without looking much at the code:
- [number of texts] We store the text begin positions (
text_begin), we also have select and rank support for this vector. The number of texts would then berank(text_begin, size()) // +1 ??. This should be constant. - [number of texts] We number of texts during construction and could just store another
size_t. Should be faster than rank, but will change the index serialisation. - [text size] Either store, or have a function that does the
nseq == 1/nseq == 2check.
Question: Do we also need the sizes of individual texts in the collection?
- With
text_beginas well as rank/select, we could determine the text size (text_size(x) == select(x, text_begin) - select(x + 1, text_begin); // probably off by one). This should also be constant. - We know the text sizes and number of texts during construction. So, we could store the text_lengths in a vector. Might get "quite" big for big collections, and also changes the index serialisation.
(cc @SGSSGene)