lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Use multi-select instead of a full sort for DynamicRange creation

Open HoustonPutman opened this issue 4 months ago • 1 comments

Resolves #13760

Description

This is using a similar approach to how Solr used to compute multiple percentiles at a single time. Basically utilize the quick select method, but instead of following a single path, follow a path for each of the ks that is requested. Multi-quickselect.

That's what I originally made, until I realized that the DynamicRangeUtil is weighted, so I refactored it to choose by weights instead, and also capture the running-value-total and running-weight-total, because that information is used in the DynamicRangeInfo.

My goal was to add this as a generic capability of the Selector (or IntroSelector) class, but because of the limitations above, it is currently a separate class to handle this. If there's any suggestions on how to make this generic enough to be put in the generic class, that would be great. But it might not be worth the effort if it wouldn't be used anywhere else.

As for the original multi-quickSelect algorithm I mentioned, I looked for other multi-select use cases across Lucene, but I only found one instance (ScalarQuantizer does two select calls in succession). If there's more instances we can find, I would be happy to add multiSelect as an option on the Selector class, and implement it in all provided classes.

To-Do

  • The code needs to be cleaned up and better documented, this is just a POC
  • Benchmarks comparing this to the full-sorting implementation.

Caveat

The implement is slightly different, as it will pick the groups according to "The first value for which the running weight is <= weight-range-boundary". The old logic would start counting again after a weight range was complete, which removes information from the overflow of previous weight-ranges. I'm not sure either approach is right or wrong, but I wanted to explicitly state how the results would be different and why I had to alter a unit test to pass.

HoustonPutman avatar Oct 15 '24 00:10 HoustonPutman