lucene icon indicating copy to clipboard operation
lucene copied to clipboard

LUCENE-10560: Faster merging of TermsEnum

Open jpountz opened this issue 2 years ago • 0 comments

This commit adds a new TermsEnumIndex abstraction in oal.index that wraps a TermsEnum and an index of the segment that it belongs to, and can be used to create priority queues that merge TermsEnum instances (either from the inverted index or from doc values). In either case, a long that holds the first 8 bytes of the term is computed in order to speed up comparisons. In the doc-values case, OrdinalMap also leverages seek-by-ord capabilities to reason about shared prefixes across entire windows of terms to not compare shared prefixes whenever re-ordering the queue, this should especially help with fields that may share long common prefixes like URLs.

On luceneutil's OrdinalMap benchmark, construction time reduced by 30.5% for the id field and by 17.5% for the name field.

JIRA: LUCENE-10560

jpountz avatar Jul 29 '22 12:07 jpountz