Adaptive compression method for various column
Let's create the task Adaptive compression method for various column
Originally posted by @sundy-li in https://github.com/datafuselabs/databend/issues/5174#issuecomment-1120441514
To start further discussion, here is my green thought: add a byte to represent compression algorithm in the block, and try to find the best compression algorithm for the block.
In the worst case, no algorithm would be able to compress the block, each block cost 1 byte extra most.
The way to find best algorithm: I do belive there might be a better way to find it, but the only solution that I've able to find, is compress sample data with a set of algorithm, and peek the best.
Yes, I think the best compression algorithm can be learned from the data histogram distribution.
Let's split this task into two stages:
-
Users can manually indicate the compression method in SQL
-
Backgroud process task to train and learn the best compression method
Users can manually indicate the compression method in SQL
I think it is not a good idea for Databend, no manual is our design goal.
Backgroud process task to train and learn the best compression method
Great.
The point to note is that Databend is a cloud data warehouse, the storage(like S3) is unlimited and cheap. The user's concern is(may) not the data size but the performance.
- Users can manually indicate the compression method in SQL
Maybe we should also take the encoding( or specialized codec for CH) into account as well so that we can define a column like this:
col DateTime Codec(RLE, LZ4) ...
currently, the column encoding is PLAIN https://github.com/datafuselabs/databend/blob/eb9f43c9925c1f0c90e34fb91db02c253552c9f1/query/src/storages/fuse/io/write/block_writer.rs#L81-L92
comments in the code are outdated (parquet2 improved actively)