lucene
lucene copied to clipboard
LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues
This PR proposes to add a new method to SortedDocValues that helps users to advance an iterator to the next document that contains a different term that the current document, which can be specially useful when the index is sorted by this field.
The method contains a default implementation but this PR produces as well a fast implementation when the index is sorted by this field and it has low cardinality. In this case we write to disk a jump table that allows to quickly skip documents instead of manually iterating through the docs.
In https://issues.apache.org/jira/browse/LUCENE-10396 it is discussed some of the use cases where this method can be used, for example computing the number of unique values for documents that match a query. On the other hand, it diverges from the sparse index approach but as this ids less intrusive, it seems appealing.
Note that in order to handle backwards compatibility, I have increase the version of the codec instead of creating a new one.
#11432
I make a quick check if this patch by indexing 50 million documents in a sorted index. The documents just contain a SortedDocValues with a 10 bytes term. I checked the index size and the speed of retrieving the first document per term with different cardinalities and the results looks like:
Cardinality ~1000
| without patch | with patch
Index Size (MB) | 2.800084114074707 | 2.8039379119873047
average advanceOrd (s) | 0.39255053534999995 | 0.0011012437999999999
Cardinality ~10000
| without patch | with patch
Index Size (MB) | 16.125946044921875 | 16.164132118225098
average advanceOrd (s) | 0.52939177705 | 0.01008831655
Cardinality ~10000
| without patch | with patch
Index Size (MB) | 49.320682525634766 | 49.57721138000488
average advanceOrd (s) | 0.5479114709999999 | 0.03804306865
Cardinality ~50000
| without patch | with patch
Index Size (MB) | 52.81498718261719 | 53.66002082824707
average advanceOrd (s) | 0.6515335270999999 | 0.06898821255000001
The new jump table is tiny compared to the size of the doc value while this new way of navigation os at least one order of magnitude faster.
@iverase I was playing with this idea a little bit for a use-case I'm working on. It didn't pan out unfortunately, but in the process, I did take the time to rebase this change on the tip of main
. Do you mind if I push the rebase to your PR branch since I did the work to rebase?
Sure!