lucene
lucene copied to clipboard
Use multi-select instead of a full sort for DynamicRange creation
Resolves #13760
Description
This is using a similar approach to how Solr used to compute multiple percentiles at a single time. Basically utilize the quick select method, but instead of following a single path, follow a path for each of the k
s that is requested. Multi-quickselect.
That's what I originally made, until I realized that the DynamicRangeUtil
is weighted, so I refactored it to choose by weights instead, and also capture the running-value-total and running-weight-total, because that information is used in the DynamicRangeInfo
.
My goal was to add this as a generic capability of the Selector
(or IntroSelector
) class, but because of the limitations above, it is currently a separate class to handle this. If there's any suggestions on how to make this generic enough to be put in the generic class, that would be great. But it might not be worth the effort if it wouldn't be used anywhere else.
As for the original multi-quickSelect algorithm I mentioned, I looked for other multi-select use cases across Lucene, but I only found one instance (ScalarQuantizer
does two select calls in succession). If there's more instances we can find, I would be happy to add multiSelect as an option on the Selector
class, and implement it in all provided classes.
To-Do
- The code needs to be cleaned up and better documented, this is just a POC
- Benchmarks comparing this to the full-sorting implementation.
Caveat
The implement is slightly different, as it will pick the groups according to "The first value for which the running weight is <= weight-range-boundary". The old logic would start counting again after a weight range was complete, which removes information from the overflow of previous weight-ranges. I'm not sure either approach is right or wrong, but I wanted to explicitly state how the results would be different and why I had to alter a unit test to pass.