Efficient iteration over deleted doc values
Description
As part of https://github.com/apache/lucene/pull/14439, we introduced efficient histogram collection using PointTrees. Unfortunately, the optimization falls apart even with single deleted document in a segment. I am wondering if there is a way to efficiently iterate over deleted doc values and correct the values for each bucket.
I had thought about it a while back (and kind of forgot) and @mikemccand brought it up again during my talk on Efficient Histogram Collection in Lucene with BulkCollector at recently concluded Apache Community over Code
Hey @jainankitk -- I brought this up on the dev list last year here: https://lists.apache.org/thread/jxczhgn5loqwn10xrb30k2hg0jrbovcs
As Adrien called out in that thread, you can use nextClearBit() to step through the deleted docs, but as I called out, that's O(maxDoc). I had the idea of storing an array of doc IDs, but that would require a binary search to do random lookups. Uwe had a nice idea of using a SparseFixedBitSet if the number of deletes is small, which I think would give us the best of both worlds (O(1) random access and O(deletedDocs) traversal).
Hi @jainankitk @msfroh,
I noticed this issue is related to #13084, which proposes using SparseFixedBitSet for sparse deletes. That approach would enable O(deletedDocs) iteration which should help with the histogram performance issue described here.
If no one is actively working on that, I'd be interested in implementing the solution from #13084. This issue could serve as a good validation case for that work. Let me know if you have any concerns or were planning to tackle this!
If no one is actively working on that, I'd be interested in implementing the solution from https://github.com/apache/lucene/issues/13084. This issue could serve as a good validation case for that work. Let me know if you have any concerns or were planning to tackle this!
That sounds great, thank you @salvatorecampagna!
Thanks @salvatorecampagna for expressing interest. Let me review the proposal - https://github.com/apache/lucene/issues/13084#issuecomment-3496411381, and provide feedback on that issue itself.
We can keep this issue for tracking the work validation by directly using the deletedDocsIterator() API.
FYI: PR #15413 adds efficient deleted document iteration that might help with this issue.