lucene icon indicating copy to clipboard operation
lucene copied to clipboard

LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues

Open iverase opened this issue 2 years ago • 1 comments

This PR proposes to add a new method to SortedDocValues that helps users to advance an iterator to the next document that contains a different term that the current document, which can be specially useful when the index is sorted by this field.

The method contains a default implementation but this PR produces as well a fast implementation when the index is sorted by this field and it has low cardinality. In this case we write to disk a jump table that allows to quickly skip documents instead of manually iterating through the docs.

In https://issues.apache.org/jira/browse/LUCENE-10396 it is discussed some of the use cases where this method can be used, for example computing the number of unique values for documents that match a query. On the other hand, it diverges from the sparse index approach but as this ids less intrusive, it seems appealing.

Note that in order to handle backwards compatibility, I have increase the version of the codec instead of creating a new one.

#11432

iverase avatar Jun 24 '22 12:06 iverase

I make a quick check if this patch by indexing 50 million documents in a sorted index. The documents just contain a SortedDocValues with a 10 bytes term. I checked the index size and the speed of retrieving the first document per term with different cardinalities and the results looks like:

Cardinality ~1000

                       |  without patch       | with patch           
Index Size (MB)        |  2.800084114074707   |  2.8039379119873047 
average advanceOrd (s) |  0.39255053534999995 |  0.0011012437999999999

Cardinality ~10000

                       |  without patch       | with patch                             
Index Size (MB)        |  16.125946044921875  |  16.164132118225098
average advanceOrd (s) |  0.52939177705       |  0.01008831655

Cardinality ~10000

                       |  without patch       | with patch           
Index Size (MB)        | 49.320682525634766   |  49.57721138000488
average advanceOrd (s) |  0.5479114709999999  |  0.03804306865

Cardinality ~50000

                       |  without patch       | with patch           
Index Size (MB)        |   52.81498718261719  |  53.66002082824707 
average advanceOrd (s) |   0.6515335270999999 |  0.06898821255000001

The new jump table is tiny compared to the size of the doc value while this new way of navigation os at least one order of magnitude faster.

iverase avatar Jun 28 '22 10:06 iverase

@iverase I was playing with this idea a little bit for a use-case I'm working on. It didn't pan out unfortunately, but in the process, I did take the time to rebase this change on the tip of main. Do you mind if I push the rebase to your PR branch since I did the work to rebase?

gsmiller avatar Mar 30 '23 15:03 gsmiller

Sure!

iverase avatar Mar 30 '23 16:03 iverase