Reuse BitSet when there are deleted documents in the index instead of creating new BitSet
Description
Fixes issue: #12414
Before this change we were creating new BitSet every time when there are deletions in the index with use of matched Docs and Live Docs, which required iteration over all matched docs which is a time consuming process with linear time complexity. This is not required when the iterator is of BitSetIterator instance.
With this change we have wrapped matching Docs and live Docs under single Bits instance which can be directly passed for HNSW search. So, now during HNSW Search when a node is explored, we will check if the doc is accepted or not by checking bits of both matchedDocs and liveDocs which is constant time operation. Cost (int cost = acceptDocs.length()) of the new acceptedDocs is not exactly accurate but gives an upper bound on number, as it is not considering live-docs count.
Thanks @shubhamvishu for taking a look.
I went through the change but I didn't understand how are we not reusing the bitset in the current approach. We do wrap the BitSetIterator with a FilteredDocIdSetIterator when there are deleted docs right which would eventually use the bitset to advance the inner iterator(See this).
Sorry! I think, I should have used different title for this PR. The part in the current approach, which I am trying to optimize is that when the iterator is of BitSetIterator instance and live docs are not null. So in current approach we create a new BitSet while taking live docs into consideration. But this bitset creation is a linear time complexity process, because to create bitset we need to iterate over all matched docs. This BitSet creation is not required as we can wrap both matched docs bitset and live docs bitset under single Bits instance which can be later used directly during approximate search. So instead of creating new Bitset, we are computing if a document is valid for searching or not at runtime. This saves us time to create new BitSet.
@kaivalnp We could use the acceptDocs.cardinality() when its a BitSetIterator to get the upper bound which might have some deletes but that would still change the decision sometimes of whether to go for exact search or not. Since we don't know how many of those docs are live but we do know the num of deletes in the segment(we don't know the intersections of these two). One thing that might be tried is to come up with some heuristic that adds some penalty to the cost based on the num of deletes in the segment (i.e. ctx.reader().numDeletedDocs()/ctx.reader().maxDoc()). Like maybe if there are 10% deletes we could for eg decrease the cost by 10% or maybe 5%. This might help in cases where we miss falling back to exact search. Though this would need some thorough benchmarking to see what works best.
On separate note, I'm thinking if there is some use case where we don't require to know this cost upfront and directly go for approximate search only for instance. Currently, this optimization only kicks in when the iterator is of BitSetIterator but if its possible to ignore this cost step or get this cost by some other heuristic/approximation then we could completely make it completely lazily evaluated using DISI#advance(docid) for those use cases. @msokolov @benwtrent Maybe you could share your thoughts on this?
Is our goal memory usage or speed?
We could use FixedBitSet#intersectionCount and keep from having to create a new bit set that is the intersection.
I am honestly not sure if the implementation here is any faster than just creating the bit set upfront and checking it. During search, you now have to check two bitsets now instead of one.
If the filter happens to be < number of docs visited in a typical search, your implementation here seems like it would be slower.
Broad feedback: any "optimizations" without benchmarking aren't optimizations, they are just guesses.
I am curious to see if this helps CPU usage in anyway. I could see it helping memory usage.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
I don't fully understand this change, but it looks like it is stalled on proving it shows lower CPU and/or heap/GC load?
Could we benchmark this change using luceneutil? It's able to create vector indices that have X% deletions and then run KNNByte/FloatVectorQuery...
Thanks @mikemccand for the pointers. Will try to run benchmarks on this change.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!