lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Efficient iteration over deleted doc values

Open jainankitk opened this issue 3 months ago • 5 comments

Description

As part of https://github.com/apache/lucene/pull/14439, we introduced efficient histogram collection using PointTrees. Unfortunately, the optimization falls apart even with single deleted document in a segment. I am wondering if there is a way to efficiently iterate over deleted doc values and correct the values for each bucket.

I had thought about it a while back (and kind of forgot) and @mikemccand brought it up again during my talk on Efficient Histogram Collection in Lucene with BulkCollector at recently concluded Apache Community over Code

jainankitk avatar Sep 23 '25 22:09 jainankitk

Hey @jainankitk -- I brought this up on the dev list last year here: https://lists.apache.org/thread/jxczhgn5loqwn10xrb30k2hg0jrbovcs

As Adrien called out in that thread, you can use nextClearBit() to step through the deleted docs, but as I called out, that's O(maxDoc). I had the idea of storing an array of doc IDs, but that would require a binary search to do random lookups. Uwe had a nice idea of using a SparseFixedBitSet if the number of deletes is small, which I think would give us the best of both worlds (O(1) random access and O(deletedDocs) traversal).

msfroh avatar Oct 15 '25 18:10 msfroh

Hi @jainankitk @msfroh,

I noticed this issue is related to #13084, which proposes using SparseFixedBitSet for sparse deletes. That approach would enable O(deletedDocs) iteration which should help with the histogram performance issue described here.

If no one is actively working on that, I'd be interested in implementing the solution from #13084. This issue could serve as a good validation case for that work. Let me know if you have any concerns or were planning to tackle this!

salvatorecampagna avatar Nov 06 '25 11:11 salvatorecampagna

If no one is actively working on that, I'd be interested in implementing the solution from https://github.com/apache/lucene/issues/13084. This issue could serve as a good validation case for that work. Let me know if you have any concerns or were planning to tackle this!

That sounds great, thank you @salvatorecampagna!

msfroh avatar Nov 06 '25 21:11 msfroh

Thanks @salvatorecampagna for expressing interest. Let me review the proposal - https://github.com/apache/lucene/issues/13084#issuecomment-3496411381, and provide feedback on that issue itself.

We can keep this issue for tracking the work validation by directly using the deletedDocsIterator() API.

jainankitk avatar Nov 07 '25 01:11 jainankitk

FYI: PR #15413 adds efficient deleted document iteration that might help with this issue.

salvatorecampagna avatar Nov 11 '25 10:11 salvatorecampagna