databend Feature: histogram aggregate function

Summary

Description for this feature.

To support histogram stats in analyze table query in #14587 , we need to support histogram aggregate function now.

Example:

histogram(number_of_bins)(values)

Reference to:

Feb 04 '24 09:02 sundy-li

Hi I'm new here, I am interested in this issue, can you assign this issue to me?

Feb 05 '24 07:02 suimenno3002

After searching for information, I have some ideas about how to implement this issue.

Based on requirements, I think the results of this query may be used to provide information to the optimizer, so the skew problem needs to be avoided, and some type of accuracy is required.

For reference [1], its code implementation is located at here (A Streaming Parallel Decision Tree Algorithm, in quick reading ,finally, I think this is an approximate estimation algorithm that can be used to give users an intuitive feel for the data distribution.

pros:

less memory usage, less state saving and transfer.
can be updated incrementally

cons:

imprecise
cannot avoid the skew problem

For reference [2], its code implementation is at here, refers to the MySQL implementation of the equal-height histogram algorithm. If we want to feed the results into the optimizer, I think this is what we want.

pros:

accurate
avoid the skew problem to a certain extent

cons:

more memory usage, more state saving and transfer.

Of course, equal-height histogram also has skew problems sometimes. For example, PostgreSQL and PolarDB use compressed histogram to obtain better results. However, for the implementation complexity, I think we can implement equal-height histogram first, and if we want to solve extreme skew problem in the future, we can implement compressed histogram based on equal-height histogram.

So I will refer to the implementation of Doris to implement this function, and refer to the existing Aggregate Function Feature PR to think about how to implement it. As a newcomer to databases domain and Databend, I have the following questions and want to get some help:

Are my above analysis of this issue and implementation ideas correct?
I’m not sure whether Databend already has a ready-made equal-height histogram implementation.
If we choose to implement equal-height histogram, one thing I don’t understand is the final process of building equal-height histogram based on Singleton Histogram. I think that it is completed in one instance. If the table is too large, will it cause OOM? I cannot find the sampling-related options about histogram from the Doris documentation, or maybe we can consider this in the future.

Feb 06 '24 18:02 suimenno3002

@suimenno3002 Great summary. Your analysis is almost correct. I think it's better to have equal-height histogram which avoids the skew problem and accurate results.

We don't have this implementation now. For singleton histogram OOM problem due to large state of the Map<T, int>, we can improve this later (compression or sampling).

cc @leiysky for more suggestions.

Feb 07 '24 01:02 sundy-li

@suimenno3002 are you working on this issue? If not we will assign it to our inner developer to have this issue because we need this feature in recent updates.

Mar 04 '24 07:03 sundy-li

@sundy-li sorry for making you worry, yes i am working on it, is it too late to submit a PR this weekend? In addition, does this version need to support string type aggregation?

Mar 04 '24 13:03 suimenno3002

Yes, you can submit a draft pr indicates that you are working on this.

does this version need to support string type aggregate?

Yes. For numeric and string type, which covered by RangeIndex::supported_type function

Mar 04 '24 13:03 sundy-li

Got it, I'll submit the draft pr tomorrow morning.

Mar 04 '24 14:03 suimenno3002

I believe this issue can be closed, or is there something else left to do? aggregate func was merged in #14839

Aug 17 '24 09:08 wzslr321

databend databend copied to clipboard

Feature: histogram aggregate function

databend
databend copied to clipboard