lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Move index of numbers from BKD into DocValuesSkipper?

Open gf2121 opened this issue 6 months ago • 2 comments

Description

Immature idea, for discussion purpose :)

Today our range query on numbers have two approaches: BKD or DocValue Skipper.

The drawback of BKD:

  • Allocating too much memory (big bitset)
  • can not respect small lead clause lazily (now eased by IndexOrDocValue)
  • can not respect intra-segment concurrency.
  • doc ids are usually stored 3 bytes.

The drawback of DocValue Skipper:

  • Slow if we can not skip on one block.

I wonder if we can have a doc id list stored in value order in DocValue Skipper to replace BKD totally. These docs should be split into smaller blocks (128 or 512) and can be vectorized decoded. This idea sounds really like we have a small BKD in each block.

For example:

Today we have:
doc 1, 2, 3, 5, 6
value 3, 5, 2, 4, 1

we store another doc list 6, 3, 1, 5, 2
whose values are in order.

Advantages:

  1. solve drawbacks of BKDs.
  2. ease slow problem of DocValue Skipper

Disadvantages:

  1. Comparing to BKD: this can not gather all docs needed on segment level. If docs are evenly distributed, there may be just a few docs required in a block, then we can not append whole doc block without caring values (addAll). Maybe we should not build this on level0 block but higher, like 65536-docs block?

gf2121 avatar Jun 26 '25 08:06 gf2121

Reading this issue made me wonder if we could make points indexes better by indexing them by doc ID, ie. treating doc IDs like any other dimension. Then there should be a way to not construct a FixedBitSet(maxDoc) up-front but to walk the BKD tree as the DocIdSetIterator advances?

jpountz avatar Jun 28 '25 20:06 jpountz

I have tried a solution https://github.com/apache/lucene/pull/15446 with DocIdSet caching to work with intra-segment

Reading this issue made me wonder if we could make points indexes better by indexing them by doc ID, ie. treating doc IDs like any other dimension. Then there should be a way to not construct a FixedBitSet(maxDoc) up-front but to walk the BKD tree as the DocIdSetIterator advances?

This would be a cleaner solution for intra-segment where each partition only visits subtrees containing its doc range.

prudhvigodithi avatar Dec 03 '25 21:12 prudhvigodithi