performance: compressing for short strings
Summary From my test, hits Q22 is slow and reads more data than snowflake. SQL:
SELECT SearchPhrase, MIN(URL), COUNT(*) AS c FROM hits WHERE URL LIKE '%google%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
Databend:
Scan: 2.6G

Snowflake:
Scan: 1.6G

References:
-
For URL short string, some improvements in DuckDB, Lightweight Compression in DuckDB.
-
smaz needs a try: https://docs.rs/smaz/latest/smaz/
This is a small string compressed by 50%
foobar compressed by 34%
the end compressed by 58%
not-a-g00d-Exampl333 enlarged by 15%
Smaz is a simple compression library compressed by 39%
Nothing is more difficult, and therefore more precious, than to be able to decide compressed by 49%
this is an example of what works very well with smaz compressed by 49%
1000 numbers 2000 will 10 20 30 compress very little compressed by 10%
and now a few italian sentences: compressed by 41%
Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura compressed by 33%
Mi illumino di immenso compressed by 37%
L'autore di questa libreria vive in Sicilia compressed by 28%
try it against urls compressed by 37%
http://google.com compressed by 59%
http://programming.reddit.com compressed by 52%
hi BohuTANG, I try to optimize this issue.
I Deploy a Standalone Databend, and create hits table , and load data by copy into hits from 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz' FILE_FORMAT=(type='TSV' compression=AUTO);
But I got a error, how can i solve this problem๏ผ

@ethzx Try to load the data using streaming load, I did not test it by copy into from GZ compressed files.
Also, the original data is too large, you can use this https://databend.rs/doc/use-cases/analyze-hits-dataset-with-databend
BTW, I do not think it's an easy task.
@ethzx
This error is due to your network and the hits dataset server.
You can try the following:
Aliyun ECS(ap-south region)
AWS EC2
hello, Is this still a problem after btrblocks?
hello, Is this still a problem after btrblocks?
We don't have any improvement in the parquet format.
hello, Is this still a problem after btrblocks?
We don't have any improvement in the parquet format.
Hi, I found that parquet also supports lightweight codecs (like RLE, delte...), but databend only uses Encoding::PLAIN at blocks_to_parquet. Because of performance issues https://github.com/datafuselabs/databend/pull/9412
I'm not sure if the same performance issue exists with the native format? If no problems are found in the native format, maybe we should also use btrblocks in the parquet format to choose a more suitable codec for different data distributions?
Hi.
databend only uses Encoding::PLAIN at blocks_to_parquet. Because of performance issues https://github.com/datafuselabs/databend/pull/9412
Yes, we found plain encode is enough in S3 because we already have the common compressor (lz4, zstd), plain encode will save the extra CPU resource.
I'm not sure if the same performance issue exists with the native format?
Native format is experimental format, we can keep improving this format. Currently, native format's deserialization is faster than parquet. In s3 storage, encoder with a little bit high compression ratio seems not to be the first algorithm to optimize. Because s3 has high io through-put, it does not matter to read a bit more data, actually we already merge much small io to be a large one with merge io.
Parquet format has its standard formats, we can't modify it as btrblocks with incompatibilities otherwise we will have a new format.
As it describes in paper
Update the standard or create a new format? Yet, improving existing widespread formats such as Parquet is more desirable than creating a new data format: For users, there would be no costly data migration, no breaking changes and fast decompression just by updating a library version. Unfortunately, our experiments indi- cate that low-level improvements are not enough, and integrating larger parts of BtrBlocks โ such as new encodings and cascading compression โ into Parquet will cause version incompatibilities. Such a โParquet v3โ would not share much with the original besides the name, with no actual benefit to existing users of Parquet. In- stead, we have open-sourced BtrBlocks and hope that compatible improvements will find their way into Parquet, while also building a new format based on BtrBlocks that is independent of Parquet
Thanks for your reply!
Native format is experimental format, we can keep improving this format. Currently, native format's deserialization is faster than parquet. In s3 storage, encoder with a little bit high compression ratio seems not to be the first algorithm to optimize. Because s3 has high io through-put, it does not matter to read a bit more data, actually we already merge much small io to be a large one with merge io.
Parquet format has its standard formats, we can't modify it as
btrblockswith incompatibilities otherwise we will have a new format. As it describes in paper
So, we don't need to pay too much attention to the issue of parquet format, just wait for the native format to be ready for production, do I understand correctly?