tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

perf(columnar): Use PGM-index for O(1) position→docid lookups in multi-valued columns

Open rpunkfu opened this issue 4 months ago • 2 comments

Problem

The existing implementation uses a linear scan over start_index_column to find which document owns each value position. For sparse queries (small value ranges across many documents), this becomes a bottleneck.

Solution

• Build a PGM-index lazily over start_index_column (monotonically increasing offsets)
• Use PGM's upper_bound for O(1) predecessor lookups instead of linear scan
• Only activates for columns with ≥1024 documents (smaller columns use original path)

Benchmark Results

                            WITH PGM    WITHOUT PGM    Improvement
    Sparse range query      0.33ms      0.42ms         28% faster
    Full scan               0.35ms      0.35ms         Same

Testing

All tests still pass, plus additional benches I've added.

rpunkfu avatar Nov 29 '25 03:11 rpunkfu

I don't think the benchmark make sense for our use case.

I assume this PR is not motivated by a query you are trying to accelerate, but rather, you want to use pgm somewhere. Is this correct?

There are many places where it can be used tbh. Can you add a bench highlighting the cost of construction?

fulmicoton avatar Dec 01 '25 09:12 fulmicoton

Hi, that's correct -- I've written PGM-Index implementation and wanted to find good fit for it, this seemed like a good one for searching across multiple documents. I'll extend my benchmark results with construction later today.

rpunkfu avatar Dec 01 '25 14:12 rpunkfu