ClickHouse
ClickHouse copied to clipboard
Add FSST as compression codec
Closes #34246
Changelog category (leave one):
- Improvement
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added FSST (Fast Static Symbol Table) as a new compression codec for string columns.
Documentation entry for user-facing changes
- [ ] Documentation is written (mandatory for new features)
Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
- [ ] Fast test
- [ ] Integration Tests
- [ ] Stateless tests
- [ ] Stateful tests
- [ ] Unit tests
- [ ] Performance tests
- [ ] All with ASAN
- [ ] All with TSAN
- [ ] All with Analyzer
- [ ] Add your option here
Exclude tests:
- [ ] Fast test
- [ ] Integration Tests
- [ ] Stateless tests
- [ ] Stateful tests
- [ ] Performance tests
- [ ] All with ASAN
- [ ] All with TSAN
- [ ] All with MSAN
- [ ] All with UBSAN
- [ ] All with Coverage
- [ ] All with Aarch64
- [ ] Add your option here
Extra options:
- [ ] do not test (only style check)
- [ ] disable merge-commit (no merge from master before tests)
- [ ] disable CI cache (job reuse)
Only specified batches in multi-batch jobs:
- [ ] 1
- [ ] 2
- [ ] 3
- [ ] 4
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.
:white_check_mark: nikitamikhaylov
:white_check_mark: rschu1ze
:x: Ulad
Ulad seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.
This is an automatic comment. The PR descriptions does not match the template.
Please, edit it accordingly.
The error is: More than one changelog category specified: 'New Feature', 'Improvement'
Ref https://github.com/ClickHouse/ClickHouse/issues/34246
This is an automated comment for commit e6a73251ba54b754027c697d5b56d28f7a80b149 with description of existing statuses. It's updated for the latest CI running
❌ Click here to open a full report in a separate page
Check name | Description | Status |
---|---|---|
A Sync | There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS | ⏳ pending |
CI running | A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR | ❌ failure |
Mergeable Check | Checks if all other necessary checks are successful | ❌ failure |
Style check | Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report | ❌ failure |
Successful checks
Check name | Description | Status |
---|---|---|
Docs check | Builds and tests the documentation | ✅ success |
PR Check | There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS | ✅ success |
Option 1: Expand the ICompressionCodec interface, add an optional Filter parameter to the decompression method—a structure containing, for example, required_substrings. The codec pretends to decompress all data but has the right to decompress unnecessary strings as empty. The filter is created in the query interpreter (more precisely, during query analysis) and then is carried through all data reading interfaces down to decompression.
Option 2: If a String (possibly Nullable, Array of String) uses the FSST codec, during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.
Without these ways to accelerate, the codec does not make much sense - for raw performance/compression rate, it is strictly worse than LZ4 or ZSTD.
I am curious how compression ration and performance compares to ZSTD?
You could load any of these datasets https://clickhouse.com/docs/en/getting-started/example-datasets and re-compress the string columns with different codecs, then do full-column scans over then so they are decompressed.
Hello! Here are the compression ratio on various datasets:
Trips - pickup_ntaname and dropoff_ntaname columns:
- FSST
- STRING(LZ4)
- LowCardinality
- ZSTD
Amazon reviews - review_body column:
- FSST
- STRING(LZ4)
- ZSTD
Reddit comments - body column:
- FSST
- STRING(LZ4)
- ZSTD
To summarize, FSST is significantly inferior to ZSTD in terms of compression ratio, but in some cases (mainly on large rows) it surpasses LZ4. We will add the results of the performance measurement soon.
Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...
I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, ICompressionCodec::doDecompress
decompresses the entire block at once, so the existing API can't leverage random access.
@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):
Option 1: Expand the ICompressionCodec
interface for random access.
There are two sub-cases:
-
get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into
doDecompress
(or an overload ofdoDecompress
) to indicate the rows to decompress. FSST is well suited for that use case. -
filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (
LowCardinality
in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.
Option 2: If a String (possibly Nullable
, Array(String)
) uses the FSST codec, then during deserialization, return a ColumnFSST
instead of ColumnString
. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.
Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...
I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately,
ICompressionCodec::doDecompress
decompresses the entire block at once, so the existing API can't leverage random access.@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):
Option 1: Expand the
ICompressionCodec
interface for random access.There are two sub-cases:
- get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into
doDecompress
(or an overload ofdoDecompress
) to indicate the rows to decompress. FSST is well suited for that use case.- filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (
LowCardinality
in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.Option 2: If a String (possibly
Nullable
,Array(String)
) uses the FSST codec, then during deserialization, return aColumnFSST
instead ofColumnString
. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.
Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...
I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately,
ICompressionCodec::doDecompress
decompresses the entire block at once, so the existing API can't leverage random access.@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):
Option 1: Expand the
ICompressionCodec
interface for random access.There are two sub-cases:
- get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into
doDecompress
(or an overload ofdoDecompress
) to indicate the rows to decompress. FSST is well suited for that use case.- filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (
LowCardinality
in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.Option 2: If a String (possibly
Nullable
,Array(String)
) uses the FSST codec, then during deserialization, return aColumnFSST
instead ofColumnString
. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.
At the moment, I'm trying to implement method number 2. It seemed to me that the first option was almost impossible due to the huge number of abstractions, that need to be filtered through, such as streams, compressed buffers, various Select processor components, and so on. The second way I'm trying to implement is like CompressedColumn, which has a slightly similar idea.
Dear @rschu1ze, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.
There are unfortunately too many unknown unknowns, e.g., what is the performance (compression/decompression) and size impact of ColumFSST, while the implementation is not finished. I still like the direction (FSSTs as internal representation for String/FixedString columns, as well as predicate pushdown to the codec level) but this needs more experimentation. I guess we should give it another try next year.