lucene Binary search the entries when all suffixes have the same length in a leaf block.

In scanToTerm phase, binary search may have a better performance when all suffixes have the same length in a leaf block.

Aug 26 '22 12:08 vsop-479

@jpountz Thanks for your review. I did a simple performance test, which indexed 1M random UUID's substring(2, 8), got 10 segments, and picked up 1K terms to search. Here is a average result of 4 times tests:

Method took:

baseline(scanToTermLeaf)ns	candidate(binarySearchTermLeaf)ns	speedup
5,757,121.5	4,761,325.5	20.9%

Whole search took:

baseline(scan)ns	candidate(binarySearch)ns	speedup
162,668,448	157,990,611	2.9%

In my test case, scanToTerm only took 3.5% of the whole search execute time, so it could only got a small speedup. I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test? I have updated the comment, please have a review.

Sep 23 '22 06:09 vsop-479

I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test?

1M documents is too much for a unit test, I was thinking of a smaller dataset, e.g. 200 fixed-size IDs and we'd make sure that the binary search works as expected for both seekCeil and seekExact for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict?

Sep 23 '22 13:09 jpountz

200 fixed-size IDs and we'd make sure that the binary search works as expected for both seekCeil and seekExact for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict?

Got it. I think it is a good idea for a unit test.

Sep 23 '22 22:09 vsop-479

@jpountz I have added a test to BasePostingsFormatTestCase. Please have a review.

Sep 26 '22 12:09 vsop-479

@jpountz Thranks for your review and suggestion. I have added a CHANGES entry and the code that assert the value of termsEnum.term() is correct after seeking. Please have a review.

Sep 28 '22 07:09 vsop-479

@vsop-479 I addressed failures related to formatting but there seems to still be a test failure.

Oct 04 '22 07:10 jpountz

@jpountz Thranks for your commit. Yes, there seems to be some problem. I am trying to figure it out.

Oct 04 '22 08:10 vsop-479

@jpountz I have fixed the code and passed the tests locally. Please have a review.

Oct 04 '22 14:10 vsop-479

@mikemccand You might want to have a look at this change since (I think) you are one of the most familiar ones with the original code.

Oct 18 '22 09:10 jpountz

I had to revert this change because of test failures, e.g. this seed reproduces on the main branch:

gradlew test --tests TestNumericDocValuesUpdates.testSortedIndex -Dtests.seed=4C6E977E1F29E069 -Dtests.locale=khq -Dtests.timezone=Asia/Hong_Kong -Dtests.asserts=true -Dtests.file.encoding=UTF-8

It seems like this test exercises some logic that BasePostingsFormatTestCase doesn't but I haven't figured out what yet.

Oct 18 '22 12:10 jpountz

@jpountz I have fixed the code and passed the test, but i can’t find the new commit in this PR (because it has been merged and reverted ?). What should i do to make the new commit shows up? Does a new PR makes sense?

Oct 28 '22 10:10 vsop-479

A new PR would be great, thank you for looking into these failures!

Oct 28 '22 11:10 jpountz

lucene lucene copied to clipboard

Binary search the entries when all suffixes have the same length in a leaf block.

lucene
lucene copied to clipboard