lucene
lucene copied to clipboard
LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues
See https://issues.apache.org/jira/browse/LUCENE-10333 #11369
PS: I'm pushing these codes to quickly see if this approach makes sense to you, I'll add some more tests if this can pass the the first round review :)
personally I'm not a fan of the api change on LongValues and the additional complexity in DirectReader just to read blocks of 2 values at a time.
I'd be more interested in a more general solution: e.g. a new docvaluesformat that compresses the values in blocks with FOR/PFOR, similar to the compression of the postings lists? It should be possible to do now that DocValues has forward-only next/advance api (again similar to Postings).
Thanks a lot for so quick feedback! Using FOR/PFOR sounds promising, i'll give it a try.
Might be easiest to prototype a new DocValuesFormat (I would fork the existing Lucene90 as a start)? That's how I would attack the problem. Then I'd try to remove (at least some) usages of stuff like DirectReader
and instead use iterators that decompress FOR/PFOR blocks, much like the postings lists?
It's a heavy duty investigation / big task, as there are two cases (dense and sparse). For dense case if we want docid 500, we know how to get to its value (500 % BLOCKSIZE
). For the sparse case, some docs may be missing: perhaps we also need skipdata, again like the postings does.
Thank you very much for your guidance! This is of great help to me. I will fork a new DocvalueFormat
for this experiment.
Actually, what I thought at first was to only change the structure of addresses, implementing a new LongValues
to replace the DirectReader
or DirectMonotonicReader
to read addresses, e.g. a ForUtilLongValues
. When users try to get long through an index, It will use ForUtil
to decompress the required block (of course, caching the block there and if the next call in the same block we can reuse it). Then it seems that I don't need to care about the skipdata for sparse situations as IndexDISI
will help me to get the index related to current docid. In this way, all i need to care is only the new implementation of LongValues
. I wonder if this approach will make sense?
Thanks again for so quick feedback and your patience!
This seems like it is worth investigating! We just try to speed up the "addressing" without dealing with all the rest of docvalues (which may be much more complex change).
FYI @gsmiller and I looked into using a FOR-like encoding for doc values at https://issues.apache.org/jira/browse/LUCENE-10033. In short, this made queries that look at most docs like the Browse*
faceting tasks much faster, but the queries that only look at a subset of the index like HighTermMonthSort
much slower.
I like this change, it's a net win that is not only going to speed up binary doc values but also sorted set and sorted numeric doc values, which also use a direct monotonic reader to map doc IDs to value indexes? DirectMonotonicReader#binarySearch
uses a similar idea to avoid doing the same operation over and over again. Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?
@jpountz Thanks a lot for the suggestion!
Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?
Actually, DirectMonotonicReader is not the only impl using this, line 176 in DirectMonotonicReader
could call the DirectReader#get(index, twin)
.
I tried to limit the change to DirectMonotonicReader
, and the luceneUtil shows the QPS increasing (mentioned in jira) decreased from 20%+ to around 7%.
I like this change, it's a net win that is not only going to speed up binary doc values but also sorted set and sorted numeric doc values, which also use a direct monotonic reader to map doc IDs to value indexes?
DirectMonotonicReader#binarySearch
uses a similar idea to avoid doing the same operation over and over again. Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?
I guess I see it as optimizing corner cases, which is why improvements aren't reflected in the current luceneutil benchmarks? But as I mentioned, my biggest concern is adding stuff to LongValues
api for this, IMO it isn't justified. I also am not a fan of doubling the compression logic for every BPV. But I'm not personally going to block the change, just please help out on the LongValues api part, and I'd like to see some realistic use-cases where we have benchmarks that it speeds things up.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!