databend
databend copied to clipboard
Feature: Trim Strings during the construction of min/max statistics
Summary
While collecting the min/max values of columns, we kept the exact values of them. For columns of type string(alike), the min/max values may be large(say, a column of type CHAR(4096)) and that makes the meta files that contain the statistics large.
It would be better if we can trim the strings to some moderate length, say 8 chars, in a way that preserves the property of min/max statistics: the trimmed max should be larger than the non-trimmed one, and the trimmed min should be lesser than the non-trimmed one.
Thus, with some loss of accuracy (slightly more likely to be false-positive, which IMO we can afford), the size of fuse table meta files could be reduced.
where the min/max vals are gathered:
https://github.com/datafuselabs/databend/blob/dae90d856e380ea29716e87148cc69d07ccff8ff/src/query/storages/fuse/src/statistics/column_statistic.rs#L45-L53