[enhancement](histogram) optimize the histogram bucketing strategy, etc

Open weizhengte opened this issue 1 year ago • 1 comments

Proposed changes

The "equal height" histogram will ensure that the sum of the frequencies of the values in each bucket is 1/N of the total number of rows.

However, strictly adhering to the principle of "equal height" for bucketing will cause some values to fall out of the boundary of the bucket, resulting in the same value appearing in two different buckets. Obviously, this will interfere with the estimation of selectivity.

So this PR has made certain modifications to it in the implementation (implemented by an algorithm):

If adding a value to a bucket would cause the data in the bucket to occur more frequently than 1/N of the total number of rows, then put the value into that bucket or the next bucket, depending on which is closer to 1/N.

Others:

Optimize the time-consuming part of the operator, and improve the performance.
Removed the sampling parameter in the operator (this can be done using SAMPLE in the query).

Issue Number: close #xxx

Problem summary

Describe your changes.

Checklist(Required)

[ ] Does it affect the original behavior
[x] Has unit tests been added
[x] Has document been added or modified
[ ] Does it need to update dependencies
[ ] Is this PR support rollback (If NO, please explain WHY)

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

Feb 28 '23 18:02 weizhengte

run buildall

Mar 04 '23 03:03 weizhengte

LGTM

Mar 08 '23 06:03 Kikyou1997

PR approved by anyone and no changes requested.

Mar 08 '23 06:03 github-actions[bot]

We'd better pesistence the histogram's bucekts in another structure rather than VARCHAR, it's very likely that the length of json string of buckets info exceeds 65533 characters.

Mar 08 '23 11:03 Kikyou1997

doris doris copied to clipboard

[enhancement](histogram) optimize the histogram bucketing strategy, etc

Proposed changes

Problem summary

Checklist(Required)

Further comments

doris
doris copied to clipboard