databend Adaptive compression method for various column

Let's create the task Adaptive compression method for various column

Originally posted by @sundy-li in https://github.com/datafuselabs/databend/issues/5174#issuecomment-1120441514

May 08 '22 15:05 sundy-li

To start further discussion, here is my green thought: add a byte to represent compression algorithm in the block, and try to find the best compression algorithm for the block.

In the worst case, no algorithm would be able to compress the block, each block cost 1 byte extra most.

The way to find best algorithm: I do belive there might be a better way to find it, but the only solution that I've able to find, is compress sample data with a set of algorithm, and peek the best.

May 08 '22 23:05 DCjanus

Yes, I think the best compression algorithm can be learned from the data histogram distribution.

Let's split this task into two stages:

Users can manually indicate the compression method in SQL
Backgroud process task to train and learn the best compression method

May 09 '22 01:05 sundy-li

Users can manually indicate the compression method in SQL

I think it is not a good idea for Databend, no manual is our design goal.

Backgroud process task to train and learn the best compression method

Great.

The point to note is that Databend is a cloud data warehouse, the storage(like S3) is unlimited and cheap. The user's concern is(may) not the data size but the performance.

May 09 '22 01:05 bohutang

Users can manually indicate the compression method in SQL

Maybe we should also take the encoding( or specialized codec for CH) into account as well so that we can define a column like this:

col DateTime Codec(RLE, LZ4) ...

currently, the column encoding is PLAIN https://github.com/datafuselabs/databend/blob/eb9f43c9925c1f0c90e34fb91db02c253552c9f1/query/src/storages/fuse/io/write/block_writer.rs#L81-L92

comments in the code are outdated (parquet2 improved actively)

May 09 '22 04:05 dantengsky