OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[Date Histogram] Investigate the safe number of buckets for which filter rewrite optimization can be applied

Open bowenlan-amzn opened this issue 9 months ago • 1 comments

Follow up tasks for #13317

The idea of the filter rewrite optimiaztion is to utilize the index structure instead of iterating over documents to get the buckets results. We are able to know how many buckets before the actual aggregate execution logic begins.

As the bucket counts increase or the number of documents that should be aggregated on decrease, the iterative method may become faster and the filter rewrite method may become slower. Currently we have a cluster setting to define the supported bucket count but it may not always work. For example, if the dataset only has 3k different values and the aggregation query asks for 1024 buckets, it is too high and wouldn't be better than just iterating over; on the other hand, if the dataset has 100k different values, we can probably support more than 1024 buckets.

This task is to investigate some rules to decide whether the optimization should be used, dynamically depending on the dataset or the index.

The biggest part of overhead normally is when reading the values from documents. The bkd index structure has all the documents as leaf nodes and will only need to be traversed through when the leaf node is intersected with the query. One idea here is to do a dummy traversal on the bkd tree to tell how many leaf node will be intersected, and how many middle node will be skipped, based on these 2 numbers, we can get a relatively accurate idea about the cost of certain range query.

bowenlan-amzn avatar May 06 '24 02:05 bowenlan-amzn

[Triage - attendees 1 2 3 4] @bowenlan-amzn Thanks for filing.

andrross avatar May 08 '24 15:05 andrross

close because new issue #14438 will include this.

bowenlan-amzn avatar Jun 18 '24 19:06 bowenlan-amzn