doris
doris copied to clipboard
[enhancement](histogram) optimize the histogram bucketing strategy, etc
Proposed changes
The "equal height" histogram will ensure that the sum of the frequencies of the values in each bucket is 1/N of the total number of rows.
However, strictly adhering to the principle of "equal height" for bucketing will cause some values to fall out of the boundary of the bucket, resulting in the same value appearing in two different buckets. Obviously, this will interfere with the estimation of selectivity.
So this PR has made certain modifications to it in the implementation (implemented by an algorithm):
If adding a value to a bucket would cause the data in the bucket to occur more frequently than 1/N of the total number of rows, then put the value into that bucket or the next bucket, depending on which is closer to 1/N.
Others:
- Optimize the time-consuming part of the operator, and improve the performance.
- Removed the sampling parameter in the operator (this can be done using SAMPLE in the query).
Issue Number: close #xxx
Problem summary
Describe your changes.
Checklist(Required)
- [ ] Does it affect the original behavior
- [x] Has unit tests been added
- [x] Has document been added or modified
- [ ] Does it need to update dependencies
- [ ] Is this PR support rollback (If NO, please explain WHY)
Further comments
If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...
run buildall
LGTM
PR approved by anyone and no changes requested.
We'd better pesistence the histogram's bucekts in another structure rather than VARCHAR, it's very likely that the length of json string of buckets info exceeds 65533 characters.