Could we specify column type, aggregation function and compression on materialized view?
Describe the bug
As topic, Could we specify column type and compression on materialized view?
Following the below example table:
CREATE TABLE IF NOT EXISTS db.mv_table ON CLUSTER '{cluster}'
(
dt_day DateTime CODEC(Delta, ZSTD(1)),
team_id LowCardinality(String) CODEC(ZSTD(1)),
"le_accumulate" SimpleAggregateFunction(sum, Float64),
.......
)ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/clusters/{cluster}/databases/{database}/tables/{table}/{shard}', '{replica}')
PARTITION BY (toYYYYMMDD(dt_day))
How do we create materialized_view and specify column type as like LowCardinality(String) or compression as CODEC(ZSTD(1)) as the same time? Additionally, how to we specify SimpleAggregateFunction(sum, Float64) during create materialized_view
If have any ideas can solve this, I will be appreciated it.
We also need this feature. Functions that do not require a state should either be a SimpleAggregateFunction by default, such as when summing a column, or we should be able to define the type explicitly. I've tried creating a table like this:
{{
config(
materialized="materialized_view",
engine="AggregatingMergeTree",
order_by=["campaign_id"],
)
}}
select campaign_id as campaign_id, countState() as response_count
from {{ ref("stg_rapid_service__responses") }}
where campaign_id is not null
group by campaign_id
This creates the response_count table as an AggregateFunction typed column. If we change the countState to count, then the column type is a simple 'UInt8, but never the SimpleAggregateFunction` that we need it to be.
We also tried setting the type explicitly in the YAML file as such:
version: 2
models:
- name: int_responses_aggregated_to_campaigns
config:
contract:
enforced: true
columns:
- name: campaign_id
data_type: String
- name: response_count
data_type: SimpleAggregateFunction(sum, UInt64)
But then we received the following issue:
| column_name | definition_type | contract_type | mismatch_reason |
|---|---|---|---|
| response_count | AggregateFunction(sum, UInt8) | SimpleAggregateFunction(sum, UInt8) | data type mismatch |
I'd be happy to contribute. However, since I've never worked on this repo before, I would appreciate some implementation hints. @BentsiLeviav , could you point me in the right direction?
Hi @jorgeparavicini
Thanks for considering a contribution!
In general:
Tables without contracts are created using the syntax:
AS (
{query}
)
When there is a contract, the column definitions in the CREATE TABLE statement are taken directly from the contract.
Here is the function responsible: https://github.com/ClickHouse/dbt-clickhouse/blob/ed4e0b6c05d0f88116572d457b1e386427b7f7a6/dbt/adapters/clickhouse/impl.py#L562-L575
As of today, some more advanced column configurations (like codec and ttl, per the docs) are only supported via contracts. Your attempt to extend that same pattern for other complex types that are not easily inferred (e.g., SimpleAggregateFunction, LowCardinality, custom CODECs) - is right. This would give users full flexibility while keeping simple/primitive types inferred automatically.
Per the error you encountered, there are adjustments you need to implement here:
https://github.com/ClickHouse/dbt-clickhouse/blob/ed4e0b6c05d0f88116572d457b1e386427b7f7a6/dbt/include/clickhouse/macros/column_spec_ddl.sql#L1-L39
Maybe save an internal map for "special" types that can't be compared based on pure string only.
I was actually able to get my problem to work using the -SimpleState combinator, was unaware of this. However, I will still create a PR for the LowCardinality support
@hueiyuan @jorgeparavicini I'm reopening this issue as I think I closed it too early. This issue cover two issues:
- Defining the CODEC
- Defining the correct type dealing with complex types like
LowCardinalityorSimpleAggregateFunction
After checking it deeply, we checked that this could be already accomplished with the current version of dbt-clickhouse by using the CAST() function and for the codec, just adding the codec in the contract. For example:
sql_for_validations.sql
{{
config(
materialized="materialized_view",
engine="AggregatingMergeTree",
order_by=["event_type"],
)
}}
select
CAST(event_type, 'LowCardinality(String)') as event_type,
CAST(countState(), 'AggregateFunction(count, UInt32)') as response_count,
CAST(maxSimpleState(event_type), 'SimpleAggregateFunction(max, LowCardinality(String))') as max_event_type
from {{ ref('user_events') }}
group by event_type
And the contract inside schema.yml
- name: sql_for_validations
config:
contract:
enforced: true
columns:
- name: event_type
data_type: LowCardinality(String)
codec: ZSTD(1)
- name: response_count
data_type: AggregateFunction(count, UInt32)
- name: max_event_type
data_type: SimpleAggregateFunction(max, LowCardinality(String))
Would this solution work for both of you? I'd like to get more information from you before making additional changes. In the meantime, since we seem to have found a way forward, I've reverted https://github.com/ClickHouse/dbt-clickhouse/pull/522. That PR helped with validation but would have broken functionality for users who are already using the CAST() function. We apologize for the revert, @jorgeparavicini. We appreciate your PR, but we want to ensure this problem can be solved using this approach.