databend icon indicating copy to clipboard operation
databend copied to clipboard

Feature: Trim Strings during the construction of min/max statistics

Open dantengsky opened this issue 2 years ago • 0 comments

Summary

While collecting the min/max values of columns, we kept the exact values of them. For columns of type string(alike), the min/max values may be large(say, a column of type CHAR(4096)) and that makes the meta files that contain the statistics large.

It would be better if we can trim the strings to some moderate length, say 8 chars, in a way that preserves the property of min/max statistics: the trimmed max should be larger than the non-trimmed one, and the trimmed min should be lesser than the non-trimmed one.

Thus, with some loss of accuracy (slightly more likely to be false-positive, which IMO we can afford), the size of fuse table meta files could be reduced.

where the min/max vals are gathered:

https://github.com/datafuselabs/databend/blob/dae90d856e380ea29716e87148cc69d07ccff8ff/src/query/storages/fuse/src/statistics/column_statistic.rs#L45-L53

dantengsky avatar Sep 23 '22 05:09 dantengsky