LUCENE-10311: remove complex cost estimation and abstraction leakage around it
Cost estimation drives the API complexity out of control, we don't need it. Hopefully i've cleared up all the API damage from this explosive leak.
Instead, FixedBitSet.approximateCardinality() is used for cost estimation. TODO: let's optimize that!
#11347
Here's a first stab of what i proposed on https://github.com/apache/lucene/pull/692
You can see how damaging the current cost() implementation is.
As followup commits we can add the grow(long) sugar that simply truncates. And we should optimize FixedBitSet.approximateCardinality(). After doing that, we should look around and see if there is any other similar damage to our APIs related to the fact that FixedBitSet had a slow approximateCardinality and fix those, too.
That change makes sense to me. FWIW my recollection from profiling DocIdSetBuilder is that the deduplication logic is cheap and most of the time is spent in LSBRadixSorter#reorder so it's ok to always deduplicate.
If we want to add the grow(long) sugar method that simply truncates to Integer.MAX_VALUE and clean up all the points callsites, or write a cool FixedBitSet.approximateCardinality, just feel free to push commits here. Otherwise I will get to these two things later and remove draft status on the PR.
Adding the sugar method is easy, it is just work.
Implementing the approximateCardinality requires some thought and prolly some benchmarking. I had in mind to just "sample" some "chunks" of the long[] and sum up Long.bitCount across the ranges. In upcoming JDK this method will get vectorized, let's take advantage of that, so then both cardinality() and approximateCardinality would get faster: https://github.com/openjdk/jdk/pull/6857
I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing:
https://github.com/apache/lucene/blob/8c67a3816b9060fa983b494886cd4f789be1d868/lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java#L562
The same for SimpleTextBKDReader#addAll
I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing
Sorry, I'm not so familiar with the code in question. Does it mean we can remove the grown parameter here and the split logic around it for the addAll() method? If so, that sounds great!
I remove the parameter grown from addAll in 4c6b436
Oh, but that might still be not correct. The buffers implementation does not grow with unique documents but with every call of BulkAdder#add because it does not discard duplicates. What I did only works if I assume that providing Integer.MAX_VALUE, the builder can add any number of calls to BulkAdder#add. Something is not totally right here as Buffers requires to know how many calls you are gonna be doing to BulkAdder#add and not the number of unique documents you are adding.
There's no way we're allowing more than Integer.MAX_VALUE calls going to this buffers thing.
seriously, look at threshold. its maxDoc >>> 7. maxDoc is an int.
when you call grow(anywhere close to Integer.MAX_VALUE) then buffers exits the stage permanently.
64 bits are not needed.
What I want to make sure is this is covered in the javadocs and we are not relying on an implementation detail.
#grow needs to be called with the number of times you are going to be calling BulkAdder#addDoc in order to make sure you don't overflow the sparse data structure. That should be added to the javadocs and maybe avoid the word documents that is causing all the confusion here.
We can add that if grow is called with Integer.MAX_VALUE, there is not limit to the calls to BulkAdder#addDoc or something along those lines.
wdyt?
Finally, we might need to modify the AssertingLeafReader as it asserts that you call #grow with the number of points you are going to visit, which in this case is not true all the time.
if we add grow(long) that simply truncates and forwards, then it encapsulates this within this class. The code stays simple and the caller doesn't need to know about it.
+1 That would hide the implementation details from users.
@iverase @jpountz I "undrafted" the PR and added a commit with the grow(long) that just truncates-n-forwards. It seems like the best compromise based on discussion above.
I also made some minor tweaks to the javadoc to try to simplify the explanation about what the grow parameter means. Again, it is kind of academic when you think about it, values larger than maxDoc >> 8 are not really needed by any code because we switch to the FixedBitSet. But the one-liner method doesn't bother me that much, i am just after keeping logic simple and abstractions minimal.
For the record this DocIdSetBuilder.Buffer has been so damaging to our code, insanely, I'm still here trying to calm down the explosion of horribleness caused by it.
I opened https://issues.apache.org/jira/browse/LUCENE-10443 as a second pass to this PR to try to really dig into the situation. Personally I am in favor of switching back to SparseFixedBitSet.
Sticking with bitsets help defend against so many bad decisions such as bringing long into these apis when its not needed. I actually don't mind if the performance drops a bit to use the SparseFixedBitSet instead of this horrible "buffer". We have to take care of complexity.
I reverted adding helper grow(long). I won't be the one bringing 64 bits into this API. It builds DocId Sets. It is an implementation-detail that for small sets it may use an inefficient list (for now).
I reverted the changes to the BKD tree as it is inconsistent with the current AssertingLeafReader implementation.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!