lucene
lucene copied to clipboard
Binary search the entries when all suffixes have the same length in a leaf block.
In scanToTerm phase, binary search may have a better performance when all suffixes have the same length in a leaf block.
@jpountz Thanks for your review. I did a simple performance test, which indexed 1M random UUID's substring(2, 8), got 10 segments, and picked up 1K terms to search. Here is a average result of 4 times tests:
Method took:
baseline(scanToTermLeaf)ns | candidate(binarySearchTermLeaf)ns | speedup |
---|---|---|
5,757,121.5 | 4,761,325.5 | 20.9% |
Whole search took:
baseline(scan)ns | candidate(binarySearch)ns | speedup |
---|---|---|
162,668,448 | 157,990,611 | 2.9% |
In my test case, scanToTerm only took 3.5% of the whole search execute time, so it could only got a small speedup. I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test? I have updated the comment, please have a review.
I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test?
1M documents is too much for a unit test, I was thinking of a smaller dataset, e.g. 200 fixed-size IDs and we'd make sure that the binary search works as expected for both seekCeil
and seekExact
for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict?
200 fixed-size IDs and we'd make sure that the binary search works as expected for both
seekCeil
andseekExact
for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict?
Got it. I think it is a good idea for a unit test.
@jpountz I have added a test to BasePostingsFormatTestCase. Please have a review.
@jpountz Thranks for your review and suggestion. I have added a CHANGES entry and the code that assert the value of termsEnum.term() is correct after seeking. Please have a review.
@vsop-479 I addressed failures related to formatting but there seems to still be a test failure.
@jpountz Thranks for your commit. Yes, there seems to be some problem. I am trying to figure it out.
@jpountz I have fixed the code and passed the tests locally. Please have a review.
@mikemccand You might want to have a look at this change since (I think) you are one of the most familiar ones with the original code.
I had to revert this change because of test failures, e.g. this seed reproduces on the main branch:
gradlew test --tests TestNumericDocValuesUpdates.testSortedIndex -Dtests.seed=4C6E977E1F29E069 -Dtests.locale=khq -Dtests.timezone=Asia/Hong_Kong -Dtests.asserts=true -Dtests.file.encoding=UTF-8
It seems like this test exercises some logic that BasePostingsFormatTestCase
doesn't but I haven't figured out what yet.
@jpountz I have fixed the code and passed the test, but i can’t find the new commit in this PR (because it has been merged and reverted ?). What should i do to make the new commit shows up? Does a new PR makes sense?
A new PR would be great, thank you for looking into these failures!