ClickHouse icon indicating copy to clipboard operation
ClickHouse copied to clipboard

Add FSST as compression codec

Open Pelanglene opened this issue 10 months ago • 6 comments

Closes #34246

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Added FSST (Fast Static Symbol Table) as a new compression codec for string columns.

Documentation entry for user-facing changes

  • [ ] Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/


Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • [ ] Fast test
  • [ ] Integration Tests
  • [ ] Stateless tests
  • [ ] Stateful tests
  • [ ] Unit tests
  • [ ] Performance tests
  • [ ] All with ASAN
  • [ ] All with TSAN
  • [ ] All with Analyzer
  • [ ] Add your option here

Exclude tests:

  • [ ] Fast test
  • [ ] Integration Tests
  • [ ] Stateless tests
  • [ ] Stateful tests
  • [ ] Performance tests
  • [ ] All with ASAN
  • [ ] All with TSAN
  • [ ] All with MSAN
  • [ ] All with UBSAN
  • [ ] All with Coverage
  • [ ] All with Aarch64
  • [ ] Add your option here

Extra options:

  • [ ] do not test (only style check)
  • [ ] disable merge-commit (no merge from master before tests)
  • [ ] disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • [ ] 1
  • [ ] 2
  • [ ] 3
  • [ ] 4

Pelanglene avatar Apr 15 '24 21:04 Pelanglene

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

:white_check_mark: nikitamikhaylov
:white_check_mark: rschu1ze
:x: Ulad


Ulad seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Apr 15 '24 21:04 CLAassistant

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: More than one changelog category specified: 'New Feature', 'Improvement'

clickhouse-ci[bot] avatar Apr 17 '24 10:04 clickhouse-ci[bot]

Ref https://github.com/ClickHouse/ClickHouse/issues/34246

vdimir avatar Apr 17 '24 10:04 vdimir

This is an automated comment for commit e6a73251ba54b754027c697d5b56d28f7a80b149 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS⏳ pending
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR❌ failure
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report❌ failure
Successful checks
Check nameDescriptionStatus
Docs checkBuilds and tests the documentation✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success

robot-ch-test-poll3 avatar Apr 25 '24 10:04 robot-ch-test-poll3

Option 1: Expand the ICompressionCodec interface, add an optional Filter parameter to the decompression method—a structure containing, for example, required_substrings. The codec pretends to decompress all data but has the right to decompress unnecessary strings as empty. The filter is created in the query interpreter (more precisely, during query analysis) and then is carried through all data reading interfaces down to decompression.

Option 2: If a String (possibly Nullable, Array of String) uses the FSST codec, during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

Without these ways to accelerate, the codec does not make much sense - for raw performance/compression rate, it is strictly worse than LZ4 or ZSTD.

alexey-milovidov avatar May 07 '24 19:05 alexey-milovidov

I am curious how compression ration and performance compares to ZSTD?

You could load any of these datasets https://clickhouse.com/docs/en/getting-started/example-datasets and re-compress the string columns with different codecs, then do full-column scans over then so they are decompressed.

Hello! Here are the compression ratio on various datasets:

Trips - pickup_ntaname and dropoff_ntaname columns:

  • FSST fsst
  • STRING(LZ4) string_lz
  • LowCardinality trips_low_cardinality
  • ZSTD trips_zstd

Amazon reviews - review_body column:

  • FSST amazon_fsst
  • STRING(LZ4) amazon_string
  • ZSTD amazon_zstd

Reddit comments - body column:

  • FSST redit_fsst
  • STRING(LZ4) redit_string
  • ZSTD redit_zstd

To summarize, FSST is significantly inferior to ZSTD in terms of compression ratio, but in some cases (mainly on large rows) it surpasses LZ4. We will add the results of the performance measurement soon.

trololo23 avatar May 10 '24 12:05 trololo23

Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...

I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, ICompressionCodec::doDecompress decompresses the entire block at once, so the existing API can't leverage random access.

@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):

Option 1: Expand the ICompressionCodec interface for random access.

There are two sub-cases:

  • get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into doDecompress (or an overload of doDecompress) to indicate the rows to decompress. FSST is well suited for that use case.
  • filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (LowCardinality in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.

Option 2: If a String (possibly Nullable, Array(String)) uses the FSST codec, then during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

rschu1ze avatar May 14 '24 10:05 rschu1ze

Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...

I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, ICompressionCodec::doDecompress decompresses the entire block at once, so the existing API can't leverage random access.

@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):

Option 1: Expand the ICompressionCodec interface for random access.

There are two sub-cases:

  • get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into doDecompress (or an overload of doDecompress) to indicate the rows to decompress. FSST is well suited for that use case.
  • filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (LowCardinality in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.

Option 2: If a String (possibly Nullable, Array(String)) uses the FSST codec, then during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...

I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, ICompressionCodec::doDecompress decompresses the entire block at once, so the existing API can't leverage random access.

@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):

Option 1: Expand the ICompressionCodec interface for random access.

There are two sub-cases:

  • get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into doDecompress (or an overload of doDecompress) to indicate the rows to decompress. FSST is well suited for that use case.
  • filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (LowCardinality in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.

Option 2: If a String (possibly Nullable, Array(String)) uses the FSST codec, then during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

At the moment, I'm trying to implement method number 2. It seemed to me that the first option was almost impossible due to the huge number of abstractions, that need to be filtered through, such as streams, compressed buffers, various Select processor components, and so on. The second way I'm trying to implement is like CompressedColumn, which has a slightly similar idea.

trololo23 avatar May 21 '24 12:05 trololo23

Dear @rschu1ze, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

woolenwolfbot[bot] avatar Jul 02 '24 13:07 woolenwolfbot[bot]

There are unfortunately too many unknown unknowns, e.g., what is the performance (compression/decompression) and size impact of ColumFSST, while the implementation is not finished. I still like the direction (FSSTs as internal representation for String/FixedString columns, as well as predicate pushdown to the codec level) but this needs more experimentation. I guess we should give it another try next year.

rschu1ze avatar Jul 02 '24 21:07 rschu1ze