ClickBench ClickHouse Performance Optimizations by Tencent

This submission builds on top of the latest ClickHouse with a series of performance optimizations, developed with support from Tencent. Each optimization has been carefully validated and is intended to be contributed upstream incrementally through individual PRs—some of which have already been merged.

Benchmark results were generated using artifacts built by the official CI pipeline of #81944, with great help from @nickitat — thank you!

The following optimizations are included:

1. Push TopN threshold to `MergeTreeSource`

Pushes the TopN threshold into MergeTreeSource to enable early filtering during the read phase. By passing the (N–1)th threshold value from the TopN state, rows below the threshold can be skipped earlier, reducing IO and improving performance.

2. Precompute hashes and prefetch for prealloc variants (previous prealloc optimization)

For ColumnsHashing implementations that support the prealloc strategy:

Key hashes are precomputed before any potential serialization.
These hashes are used to prefetch hash table cells more efficiently.
Serialization is skipped when probing by hash fails (e.g., for group-by overflow rows).

Also introduced the optimize_trivial_group_by_limit_query setting, which applies max_rows_to_group_by for trivial GROUP BY LIMIT queries to avoid unnecessary aggregation work.

3. Extend string hash map with inlined hash

The string hash map is optimized by combining string length and hash into a single 8-byte value. Since most string lengths and CRC32 hashes fit within 4 bytes, combining them:

Produces a more compact cell representation.
Enables faster string comparison through a single comparison of combined length and hash.
Improves performance by reducing memory footprint and branching.

4. Optimize index analysis with earlier QCC filtering (#82380)

Refactored the integration of Query Condition Cache (QCC) with index analysis:

QCC filtering is now applied before primary key and skip index evaluation, reducing redundant index computations.
Index analysis now supports multiple range filters and caches the filtering results back into QCC.

This notably accelerates short queries when index analysis is the dominant cost.

5. Optimize single `COUNT()` aggregation on `NOT NULL` columns (#82104)

When an aggregation query only includes a single COUNT() on a NOT NULL column:

The aggregation logic is fully inlined during hash table probing.
No aggregation state needs to be allocated or maintained.

This reduces memory usage and CPU overhead, significantly speeding up the aggregation.

6. Rewrite regular expression functions into simplified forms (#81992)

Primarily targets Q28. Introduced the optimize_rewrite_regexp_functions setting (enabled by default), allowing the optimizer to rewrite certain calls to replaceRegexpAll, replaceRegexpOne, and extract into simpler and faster forms when specific patterns are detected.

Additionally:

Enabled count_distinct_optimization by default with several related edge cases fixed.

All these optimizations have been tested and validated via the ClickHouse CI pipeline. Although benchmarked on ClickBench, they were made possible thanks to the extensive support and real-world production environment provided by Tencent (TCHouse-C). I'm continuously working on additional improvements, and will persist in contributing until ClickHouse achieves top-tier performance on ClickBench once more :)

Jun 23 '25 16:06 amosbird

This is excellent, thanks!

This PR against the ClickBench repository is similar in spirit as @kitaisreal's Ursa (i.e. a research fork of ClickHouse). If all PRs are being integrated into the main codebase anyways, perhaps we don't need this PR (or can keep it open and continuosly update it for the time being)?

Jun 23 '25 20:06 rschu1ze

Thanks for the feedback!

I'd actually prefer to have this PR merged into the ClickBench repository for a few reasons:

A continuously updated tuned baseline: This fork plays a similar role to clickhouse-tuned — a place where performance-oriented patches can be evaluated holistically. It gives us a stable, visible comparison point against upstream ClickHouse, even before all PRs are merged. This helps track net gains across batches of changes.
Surfacing trade-offs and non-merged work: Not all optimizations may land in upstream immediately — some might be blocked due to generality, compatibility, or maintenance concerns. Keeping this variant in the ClickBench repo allows us to observe and quantify the impact of those changes, even if they're not ultimately accepted upstream. It serves as a real-world benchmark of possible trade-offs.
Clear visibility of contribution impact: From my contributor's perspective, having this fork merged into ClickBench makes the performance gains more tangible and attributable. It is highly motivating and helps justify the continued investment.

This PR against the ClickBench repository is similar in spirit as @kitaisreal's Ursa (i.e. a research fork of ClickHouse).

Interesting — the string layout modification mentioned there is also implemented in ByConity (as BigString). I’ve encountered a similar need when working on the projection index feature (row-level index), where faster row seeking on string columns is critical. I’ll look into whether we can achieve this in a backward-compatible way.

Jun 24 '25 01:06 amosbird

As long as the results are reproducible, let's merge.

Jun 25 '25 02:06 alexey-milovidov

I’ll look into whether we can achieve this in a backward-compatible way.

Yes, it is entirely possible.

name the new String data type String_v2, rename the old one to String_v1;
introduce a table-level setting to interpret the String name as either String_v1 or String_v2;
creation of a table with String will rewrite it to either String_v1 or String_v2;
loading and ATTACHing a table with String will interpret it as String_v1;
the native protocol will serialize String_v1 and String_v2 as the old String with conversion;

Jun 25 '25 02:06 alexey-milovidov

which applies max_rows_to_group_by for trivial GROUP BY LIMIT queries to avoid unnecessary aggregation work

Thanks, I wanted it for a long time!

Jun 25 '25 02:06 alexey-milovidov

Yes, it is entirely possible.

Hmm, I was actually thinking of a different strategy: keep using the same type, but recognize the underlying streams — and if there's a separate size stream, apply a new serde logic accordingly. This behavior would only apply to MergeTree's wide format, which I believe should be sufficient.

Jun 25 '25 02:06 amosbird

Maybe we can try. Although, having to look up an additional file looks hacky. Another solution is to approach it as different column representations, like ColumnSparse (how do we store that a certain column is in the Sparse format?).

Jun 25 '25 02:06 alexey-milovidov

Maybe we can try. Although, having to look up an additional file looks hacky. Another solution is to approach it as different column representations, like ColumnSparse (how do we store that a certain column is in the Sparse format?).

Sure, a different serde in serialization.json is definitely better than looking for files.

Jun 25 '25 03:06 amosbird

As long as the results are reproducible, let's merge.

I've just merged an additional optimization from my team that addresses the Q23 issue . With this fix, the results should now be fully reproducible without any manual post-processing.

I've updated benchmark.sh to use the following binary:

https://clickhouse-builds.s3.amazonaws.com/PRs/81944/cda07f8aca770d97ea149eec6b477dcfd59d134e/build_amd_release/clickhouse-common-static-25.7.1.1-amd64.tgz

@rschu1ze @nickitat Could you help re-run the benchmarks and update the results on both c6a.metal and c6a.4xlarge, as those were mentioned in the Firebolt PR as the most commonly used environments?

Thanks a lot!

Jun 25 '25 06:06 amosbird

I've implemented the separate size stream string format with .size subcolumn support and corresponding optimizations. However, the performance improvement is only significant in Q27, while other queries show negligible gains or even slight regressions.

Given this, it's not suitable to enable by default, so I’ll leave it disabled for now.

P.S. The results have been updated. @alexey-milovidov — if everything looks good, would you mind helping to merge this PR? Much appreciated!

Jun 29 '25 18:06 amosbird

@rschu1ze, thanks a lot! Just wondering — the entry appears on the leaderboard, but it only shows the name without rendering the actual benchmark results. Do you know what might be causing this?

copied-2025-06-30-04_52_47_475

Jun 29 '25 20:06 amosbird

It's because we added tuned field recently. Here is the fix https://github.com/ClickHouse/ClickBench/pull/421 . cc @rschu1ze

Jun 30 '25 02:06 amosbird

Is there a plan to introduce a PR for "Push TopN threshold to MergeTreeSource"? There are some issues (https://github.com/ClickHouse/ClickHouse/issues/65990, https://github.com/ClickHouse/ClickHouse/issues/85081, I reported https://github.com/ClickHouse/ClickHouse/issues/75098) that I imagine would benefit significantly from that optimization, is that correct?

Aug 28 '25 21:08 EmeraldShift

Is there a plan to introduce a PR ...

There is definitely a plan, but since I already have several PRs pending, it will take some time to prepare and land it.

Aug 29 '25 01:08 amosbird

@amosbird, the build file used in this entry has disappeared. Please update.

Sep 07 '25 20:09 alexey-milovidov

@amosbird, the build file used in this entry has disappeared. Please update.

Affected binary: https://clickhouse-builds.s3.amazonaws.com/PRs/81944/cda07f8aca770d97ea149eec6b477dcfd59d134e/build_amd_release/clickhouse-common-static-25.7.1.1-amd64.tgz

From PR message:

Benchmark results were generated using artifacts built by the official CI pipeline of https://github.com/ClickHouse/ClickHouse/pull/81944, with great help from @nickitat — thank you!

@amosbird Simply rebase https://github.com/ClickHouse/ClickHouse/pull/81944?

Sep 08 '25 11:09 rschu1ze

Sure. I'll rebase it today.

Sep 08 '25 13:09 amosbird

ClickHouse Performance Optimizations by Tencent

1. Push TopN threshold to MergeTreeSource

2. Precompute hashes and prefetch for prealloc variants (previous prealloc optimization)

3. Extend string hash map with inlined hash

4. Optimize index analysis with earlier QCC filtering (#82380)

5. Optimize single COUNT() aggregation on NOT NULL columns (#82104)

6. Rewrite regular expression functions into simplified forms (#81992)

1. Push TopN threshold to `MergeTreeSource`

5. Optimize single `COUNT()` aggregation on `NOT NULL` columns (#82104)