dbt-clickhouse icon indicating copy to clipboard operation
dbt-clickhouse copied to clipboard

Could we specify column type, aggregation function and compression on materialized view?

Open hueiyuan opened this issue 1 year ago • 4 comments

Describe the bug

As topic, Could we specify column type and compression on materialized view?

Following the below example table:

CREATE TABLE IF NOT EXISTS db.mv_table ON CLUSTER '{cluster}'
(
    dt_day DateTime CODEC(Delta, ZSTD(1)),
    team_id LowCardinality(String) CODEC(ZSTD(1)),
    "le_accumulate" SimpleAggregateFunction(sum, Float64),
.......
)ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/clusters/{cluster}/databases/{database}/tables/{table}/{shard}', '{replica}')
PARTITION BY (toYYYYMMDD(dt_day))

How do we create materialized_view and specify column type as like LowCardinality(String) or compression as CODEC(ZSTD(1)) as the same time? Additionally, how to we specify SimpleAggregateFunction(sum, Float64) during create materialized_view

If have any ideas can solve this, I will be appreciated it.

hueiyuan avatar Jun 28 '24 06:06 hueiyuan

We also need this feature. Functions that do not require a state should either be a SimpleAggregateFunction by default, such as when summing a column, or we should be able to define the type explicitly. I've tried creating a table like this:

{{
    config(
        materialized="materialized_view",
        engine="AggregatingMergeTree",
        order_by=["campaign_id"],
    )
}}

select campaign_id as campaign_id, countState() as response_count
from {{ ref("stg_rapid_service__responses") }}
where campaign_id is not null
group by campaign_id

This creates the response_count table as an AggregateFunction typed column. If we change the countState to count, then the column type is a simple 'UInt8, but never the SimpleAggregateFunction` that we need it to be.

We also tried setting the type explicitly in the YAML file as such:

version: 2

models:
  - name: int_responses_aggregated_to_campaigns
    config:
      contract:
        enforced: true
    columns:
      - name: campaign_id
        data_type: String
      - name: response_count
        data_type: SimpleAggregateFunction(sum, UInt64)

But then we received the following issue:

column_name definition_type contract_type mismatch_reason
response_count AggregateFunction(sum, UInt8) SimpleAggregateFunction(sum, UInt8) data type mismatch

I'd be happy to contribute. However, since I've never worked on this repo before, I would appreciate some implementation hints. @BentsiLeviav , could you point me in the right direction?

jorgeparavicini avatar Sep 17 '25 16:09 jorgeparavicini

Hi @jorgeparavicini

Thanks for considering a contribution!

In general:

Tables without contracts are created using the syntax:

AS (
    {query}
)

When there is a contract, the column definitions in the CREATE TABLE statement are taken directly from the contract.

Here is the function responsible: https://github.com/ClickHouse/dbt-clickhouse/blob/ed4e0b6c05d0f88116572d457b1e386427b7f7a6/dbt/adapters/clickhouse/impl.py#L562-L575

As of today, some more advanced column configurations (like codec and ttl, per the docs) are only supported via contracts. Your attempt to extend that same pattern for other complex types that are not easily inferred (e.g., SimpleAggregateFunction, LowCardinality, custom CODECs) - is right. This would give users full flexibility while keeping simple/primitive types inferred automatically.

Per the error you encountered, there are adjustments you need to implement here:

https://github.com/ClickHouse/dbt-clickhouse/blob/ed4e0b6c05d0f88116572d457b1e386427b7f7a6/dbt/include/clickhouse/macros/column_spec_ddl.sql#L1-L39

Maybe save an internal map for "special" types that can't be compared based on pure string only.

BentsiLeviav avatar Sep 18 '25 09:09 BentsiLeviav

I was actually able to get my problem to work using the -SimpleState combinator, was unaware of this. However, I will still create a PR for the LowCardinality support

jorgeparavicini avatar Sep 18 '25 10:09 jorgeparavicini

@hueiyuan @jorgeparavicini I'm reopening this issue as I think I closed it too early. This issue cover two issues:

  • Defining the CODEC
  • Defining the correct type dealing with complex types like LowCardinality or SimpleAggregateFunction

After checking it deeply, we checked that this could be already accomplished with the current version of dbt-clickhouse by using the CAST() function and for the codec, just adding the codec in the contract. For example:

sql_for_validations.sql

{{
    config(
        materialized="materialized_view",
        engine="AggregatingMergeTree",
        order_by=["event_type"],
    )
}}

select
  CAST(event_type, 'LowCardinality(String)') as event_type,
  CAST(countState(), 'AggregateFunction(count, UInt32)') as response_count,
  CAST(maxSimpleState(event_type), 'SimpleAggregateFunction(max, LowCardinality(String))') as max_event_type
from {{ ref('user_events') }}
group by event_type

And the contract inside schema.yml

  - name: sql_for_validations
    config:
      contract:
        enforced: true
    columns:
      - name: event_type
        data_type: LowCardinality(String)
        codec: ZSTD(1)
      - name: response_count
        data_type: AggregateFunction(count, UInt32)
      - name: max_event_type
        data_type: SimpleAggregateFunction(max, LowCardinality(String))

Would this solution work for both of you? I'd like to get more information from you before making additional changes. In the meantime, since we seem to have found a way forward, I've reverted https://github.com/ClickHouse/dbt-clickhouse/pull/522. That PR helped with validation but would have broken functionality for users who are already using the CAST() function. We apologize for the revert, @jorgeparavicini. We appreciate your PR, but we want to ensure this problem can be solved using this approach.

koletzilla avatar Oct 08 '25 13:10 koletzilla