lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Binary search the entries when all suffixes have the same length in a leaf block.

Open vsop-479 opened this issue 2 years ago • 8 comments

In scanToTerm phase, binary search may have a better performance when all suffixes have the same length in a leaf block.

vsop-479 avatar Aug 26 '22 12:08 vsop-479

@jpountz Thanks for your review. I did a simple performance test, which indexed 1M random UUID's substring(2, 8), got 10 segments, and picked up 1K terms to search. Here is a average result of 4 times tests:

Method took:

baseline(scanToTermLeaf)ns candidate(binarySearchTermLeaf)ns speedup
5,757,121.5 4,761,325.5 20.9%

Whole search took:

baseline(scan)ns candidate(binarySearch)ns speedup
162,668,448 157,990,611 2.9%

In my test case, scanToTerm only took 3.5% of the whole search execute time, so it could only got a small speedup. I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test? I have updated the comment, please have a review.

vsop-479 avatar Sep 23 '22 06:09 vsop-479

I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test?

1M documents is too much for a unit test, I was thinking of a smaller dataset, e.g. 200 fixed-size IDs and we'd make sure that the binary search works as expected for both seekCeil and seekExact for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict?

jpountz avatar Sep 23 '22 13:09 jpountz

200 fixed-size IDs and we'd make sure that the binary search works as expected for both seekCeil and seekExact for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict?

Got it. I think it is a good idea for a unit test.

vsop-479 avatar Sep 23 '22 22:09 vsop-479

@jpountz I have added a test to BasePostingsFormatTestCase. Please have a review.

vsop-479 avatar Sep 26 '22 12:09 vsop-479

@jpountz Thranks for your review and suggestion. I have added a CHANGES entry and the code that assert the value of termsEnum.term() is correct after seeking. Please have a review.

vsop-479 avatar Sep 28 '22 07:09 vsop-479

@vsop-479 I addressed failures related to formatting but there seems to still be a test failure.

jpountz avatar Oct 04 '22 07:10 jpountz

@jpountz Thranks for your commit. Yes, there seems to be some problem. I am trying to figure it out.

vsop-479 avatar Oct 04 '22 08:10 vsop-479

@jpountz I have fixed the code and passed the tests locally. Please have a review.

vsop-479 avatar Oct 04 '22 14:10 vsop-479

@mikemccand You might want to have a look at this change since (I think) you are one of the most familiar ones with the original code.

jpountz avatar Oct 18 '22 09:10 jpountz

I had to revert this change because of test failures, e.g. this seed reproduces on the main branch:

gradlew test --tests TestNumericDocValuesUpdates.testSortedIndex -Dtests.seed=4C6E977E1F29E069 -Dtests.locale=khq -Dtests.timezone=Asia/Hong_Kong -Dtests.asserts=true -Dtests.file.encoding=UTF-8

It seems like this test exercises some logic that BasePostingsFormatTestCase doesn't but I haven't figured out what yet.

jpountz avatar Oct 18 '22 12:10 jpountz

@jpountz I have fixed the code and passed the test, but i can’t find the new commit in this PR (because it has been merged and reverted ?). What should i do to make the new commit shows up? Does a new PR makes sense?

vsop-479 avatar Oct 28 '22 10:10 vsop-479

A new PR would be great, thank you for looking into these failures!

jpountz avatar Oct 28 '22 11:10 jpountz