databend
databend copied to clipboard
Feature: histogram aggregate function
Summary
Description for this feature.
To support histogram stats in analyze table
query in #14587 , we need to support histogram aggregate function now.
Example:
histogram(number_of_bins)(values)
Reference to:
- clickhouse histogram
- doris histogram
- width_bucket equi-width histograms
Hi I'm new here, I am interested in this issue, can you assign this issue to me?
After searching for information, I have some ideas about how to implement this issue.
Based on requirements, I think the results of this query may be used to provide information to the optimizer, so the skew problem needs to be avoided, and some type of accuracy is required.
For reference [1], its code implementation is located at here (A Streaming Parallel Decision Tree Algorithm, in quick reading ,finally, I think this is an approximate estimation algorithm that can be used to give users an intuitive feel for the data distribution.
pros:
- less memory usage, less state saving and transfer.
- can be updated incrementally
cons:
- imprecise
- cannot avoid the skew problem
For reference [2], its code implementation is at here, refers to the MySQL implementation of the equal-height histogram algorithm. If we want to feed the results into the optimizer, I think this is what we want.
pros:
- accurate
- avoid the skew problem to a certain extent
cons:
- more memory usage, more state saving and transfer.
Of course, equal-height histogram also has skew problems sometimes. For example, PostgreSQL and PolarDB use compressed histogram to obtain better results. However, for the implementation complexity, I think we can implement equal-height histogram first, and if we want to solve extreme skew problem in the future, we can implement compressed histogram based on equal-height histogram.
So I will refer to the implementation of Doris to implement this function, and refer to the existing Aggregate Function Feature PR to think about how to implement it. As a newcomer to databases domain and Databend, I have the following questions and want to get some help:
-
Are my above analysis of this issue and implementation ideas correct?
-
Iโm not sure whether Databend already has a ready-made equal-height histogram implementation.
-
If we choose to implement equal-height histogram, one thing I donโt understand is the final process of building equal-height histogram based on Singleton Histogram. I think that it is completed in one instance. If the table is too large, will it cause OOM? I cannot find the sampling-related options about histogram from the Doris documentation, or maybe we can consider this in the future.
@suimenno3002 Great summary. Your analysis is almost correct. I think it's better to have equal-height histogram which avoids the skew problem and accurate results.
We don't have this implementation now. For singleton histogram OOM
problem due to large state of the Map<T, int>
, we can improve this later (compression or sampling).
cc @leiysky for more suggestions.
@suimenno3002 are you working on this issue? If not we will assign it to our inner developer to have this issue because we need this feature in recent updates.
@sundy-li sorry for making you worry, yes i am working on it, is it too late to submit a PR this weekend? In addition, does this version need to support string type aggregation?
Yes, you can submit a draft pr indicates that you are working on this.
does this version need to support string type aggregate?
Yes. For numeric and string type, which covered by RangeIndex::supported_type
function
Got it, I'll submit the draft pr tomorrow morning.
I believe this issue can be closed, or is there something else left to do? aggregate func was merged in #14839