lucene LUCENE-10333: Speed up BinaryDocValues with a batch reading on LongValues

See https://issues.apache.org/jira/browse/LUCENE-10333 #11369

PS: I'm pushing these codes to quickly see if this approach makes sense to you, I'll add some more tests if this can pass the the first round review :)

Dec 21 '21 04:12 gf2121

personally I'm not a fan of the api change on LongValues and the additional complexity in DirectReader just to read blocks of 2 values at a time.

I'd be more interested in a more general solution: e.g. a new docvaluesformat that compresses the values in blocks with FOR/PFOR, similar to the compression of the postings lists? It should be possible to do now that DocValues has forward-only next/advance api (again similar to Postings).

Dec 21 '21 14:12 rmuir

Thanks a lot for so quick feedback! Using FOR/PFOR sounds promising, i'll give it a try.

Dec 21 '21 15:12 gf2121

Might be easiest to prototype a new DocValuesFormat (I would fork the existing Lucene90 as a start)? That's how I would attack the problem. Then I'd try to remove (at least some) usages of stuff like DirectReader and instead use iterators that decompress FOR/PFOR blocks, much like the postings lists?

It's a heavy duty investigation / big task, as there are two cases (dense and sparse). For dense case if we want docid 500, we know how to get to its value (500 % BLOCKSIZE). For the sparse case, some docs may be missing: perhaps we also need skipdata, again like the postings does.

Dec 21 '21 16:12 rmuir

Thank you very much for your guidance! This is of great help to me. I will fork a new DocvalueFormat for this experiment.

Actually, what I thought at first was to only change the structure of addresses, implementing a new LongValues to replace the DirectReader or DirectMonotonicReader to read addresses, e.g. a ForUtilLongValues. When users try to get long through an index, It will use ForUtil to decompress the required block (of course, caching the block there and if the next call in the same block we can reuse it). Then it seems that I don't need to care about the skipdata for sparse situations as IndexDISI will help me to get the index related to current docid. In this way, all i need to care is only the new implementation of LongValues. I wonder if this approach will make sense?

Thanks again for so quick feedback and your patience!

Dec 21 '21 16:12 gf2121

This seems like it is worth investigating! We just try to speed up the "addressing" without dealing with all the rest of docvalues (which may be much more complex change).

Dec 21 '21 18:12 rmuir

FYI @gsmiller and I looked into using a FOR-like encoding for doc values at https://issues.apache.org/jira/browse/LUCENE-10033. In short, this made queries that look at most docs like the Browse* faceting tasks much faster, but the queries that only look at a subset of the index like HighTermMonthSort much slower.

I like this change, it's a net win that is not only going to speed up binary doc values but also sorted set and sorted numeric doc values, which also use a direct monotonic reader to map doc IDs to value indexes? DirectMonotonicReader#binarySearch uses a similar idea to avoid doing the same operation over and over again. Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?

Dec 22 '21 07:12 jpountz

@jpountz Thanks a lot for the suggestion!

Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?

Actually, DirectMonotonicReader is not the only impl using this, line 176 in DirectMonotonicReader could call the DirectReader#get(index, twin) .

I tried to limit the change to DirectMonotonicReader, and the luceneUtil shows the QPS increasing (mentioned in jira) decreased from 20%+ to around 7%.

Dec 22 '21 08:12 gf2121

I like this change, it's a net win that is not only going to speed up binary doc values but also sorted set and sorted numeric doc values, which also use a direct monotonic reader to map doc IDs to value indexes? DirectMonotonicReader#binarySearch uses a similar idea to avoid doing the same operation over and over again. Let's update this pull request to limit the impact to the API to DirectMonotonicReader, since it's the only LongValues impl where we ever need to read two consecutive values at the same time?

I guess I see it as optimizing corner cases, which is why improvements aren't reflected in the current luceneutil benchmarks? But as I mentioned, my biggest concern is adding stuff to LongValues api for this, IMO it isn't justified. I also am not a fan of doubling the compression logic for every BPV. But I'm not personally going to block the change, just please help out on the LongValues api part, and I'd like to see some realistic use-cases where we have benchmarks that it speeds things up.

Dec 22 '21 10:12 rmuir

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Jan 09 '24 00:01 github-actions[bot]