seqan3 icon indicating copy to clipboard operation
seqan3 copied to clipboard

[Search] Query the text size

Open joergi-w opened this issue 3 years ago • 1 comments

For calculating e-values or other purposes it is often necessary to query the text size of a (bi) FM index.

I have experimented with the size() function of seqan3::bi_fm_index<dna4, text_layout::collection> in order to calculate it myself. According to the documentation, the value of size() includes sentinels. Assume that I have stored somewhere a list of sequence names, so I know the value nseq = number of indexed sequences. Then for nseq > 1, I can compute the text size nchar = index.size() - nseq. For nseq == 1 we have a special case with nchar = index.size() - 2 (because a single sequence has 2 sentinels).

I suggest to provide a function get_text_size() for the index that performs these calculations. An issue is that we have to keep track of the number of sequences stored in the index (which I could solve with the length of the names list).

joergi-w avatar Jan 26 '22 13:01 joergi-w

Some ideas, without looking much at the code:

  • [number of texts] We store the text begin positions (text_begin), we also have select and rank support for this vector. The number of texts would then be rank(text_begin, size()) // +1 ??. This should be constant.
  • [number of texts] We number of texts during construction and could just store another size_t. Should be faster than rank, but will change the index serialisation.
  • [text size] Either store, or have a function that does the nseq == 1/nseq == 2 check.

Question: Do we also need the sizes of individual texts in the collection?

  • With text_begin as well as rank/select, we could determine the text size (text_size(x) == select(x, text_begin) - select(x + 1, text_begin); // probably off by one). This should also be constant.
  • We know the text sizes and number of texts during construction. So, we could store the text_lengths in a vector. Might get "quite" big for big collections, and also changes the index serialisation.

(cc @SGSSGene)

eseiler avatar Jan 26 '22 14:01 eseiler