tantivy SSTableIndex could be a multi layer SSTable

SSTableIndex could be a multi layer SSTable

Open fulmicoton opened this issue 2 years ago • 3 comments

trafficstars

Right now we use the SSTable format to store block "checkpoints" in a Vec and do binary search on those.

An alternative could be to store several layers of SSTable:

Layer 1 one like current SSTableIndex. Checkpoint every N / B docs.
Layer 2 Checkpoint in layer 1. N / B^2
Layer 3 Checkpoint in layer 1. N / B^3
...

Thanks to geometric series, the overhead of having that stack of layers is minor for B = 16 for instance. We would have transformed the binary search into a coarse to scale linear search, with good strong locality.

The main benefit would be to make opening a term dictionary very cheap, regardless of number of blocks.

@trinity-1686a @PSeitz let me know if I make sense? To know if this is useful, we need to know how expensive it is to open a sstable with 10 millions terms today. To accept such a change, we will need also a bench on ord_to_term (if it does not exist already)

Mar 21 '23 01:03 fulmicoton

Assigned to @trinity-1686a for the moment, but we don't even know if we want this.

Mar 21 '23 01:03 fulmicoton

In the flamegraph PSeitz linked here https://github.com/quickwit-oss/tantivy/pull/1946#issuecomment-1476377510, Dictionary::open is very fast (0.42% of samples), whereas Dictionary::ord_to_term is 9.94% of samples. A naive approach could cause ord_to_term to be N time slower (with N the number of layers). On some other workload (a simple TermQuery), the Dictionary::open is probably most of the cost and it could then make sens to do that kind of change (or maybe it doesn't). When querying the index, the block being opened should probably be cached to amortize that cost on workload which may query the index a lot (columnar case)

Mar 23 '23 11:03 trinity-1686a

unassigning myself for now, while there is possibly something to gain for large segments, I believe there are probably more impactful changes to explore

Apr 20 '23 09:04 trinity-1686a

tantivy tantivy copied to clipboard

SSTableIndex could be a multi layer SSTable

tantivy
tantivy copied to clipboard