perf(columnar): Use PGM-index for O(1) position→docid lookups in multi-valued columns
Problem
The existing implementation uses a linear scan over start_index_column to find which document owns each value position. For sparse queries (small value ranges across many documents), this becomes a bottleneck.
Solution
• Build a PGM-index lazily over start_index_column (monotonically increasing offsets)
• Use PGM's upper_bound for O(1) predecessor lookups instead of linear scan
• Only activates for columns with ≥1024 documents (smaller columns use original path)
Benchmark Results
WITH PGM WITHOUT PGM Improvement
Sparse range query 0.33ms 0.42ms 28% faster
Full scan 0.35ms 0.35ms Same
Testing
All tests still pass, plus additional benches I've added.
I don't think the benchmark make sense for our use case.
I assume this PR is not motivated by a query you are trying to accelerate, but rather, you want to use pgm somewhere. Is this correct?
There are many places where it can be used tbh. Can you add a bench highlighting the cost of construction?
Hi, that's correct -- I've written PGM-Index implementation and wanted to find good fit for it, this seemed like a good one for searching across multiple documents. I'll extend my benchmark results with construction later today.