dbt-clickhouse
dbt-clickhouse copied to clipboard
Standard `table` materialization does not work with ReplicateMergeTree engine
Not sure if this is related to #14 since I didn't explore the attached PR, but I wanted to call out this problem specifically.
It is my understanding that the way dbt-clickhouse handles table materializations is to:
- Rename table (if exists) to
{{ this }}__dbt_tmp
or something similar (I don't have the logs in front of me) - Create a new table with name
{{ this }}
- Drop temp table
With the Replication tables, this results in an error because there already exists a table with the replica path of /clickhouse/tables/{shard}/{database}/{table_name}
.
I am currently getting around this using an incremental materialization with a config like:
{{ config(
materialized="incremental",
engine="ReplicatedMergeTree('/clickhouse/tables/{database}/{table}', '{replica}')",
partition_by=[...],
order_by=[...],
unique_key=[...],
pre_hook="truncate table if exists {{ this }} on cluster '...'",
inserts_only=true
) }}
I'm not sure how we could handle a simple table materialization using the current rename-create-drop strategy since I don't believe you can update the replica path after creation and I don't believe there is a random value substitution provided which could potentially allow a replica path like /clickhouse/tables/{shard}/{database}/{table_name}__{random_uuid}
Some of this may be solved by using the Replicated database engine. See https://github.com/ClickHouse/ClickHouse/issues/31661. I'll be doing some testing with Replicated databases over the next couple of weeks.
Tests work with with Replicated database engine so explicit support for ReplicatedMergeTree engines will probably not be a priority.
For anyone that stumbles across this issue and is still using ReplicatedMergeTree
engine like us, here is a better workaround:
Project config
# dbt_project.yml
vars:
# Creates a unique ID for this dbt run using the run's timestamp
run_id: "{{ run_started_at.strftime('%Y%m%d%H%M%S%f') }}"
# Use this engine template for all models materialized as `table` that use `ReplicatedMergeTree` engine
# Zookeeper path will use default layout but the table name is suffixed with the run's ID
engine_template: >
ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/{table}_{{ var("run_id") }}', '{replica}')
Model config
-- my_table.sql
{{ config(
materialized="table",
engine=var("engine_template"),
partition_by=[...],
order_by=[...],
unique_key=[...]
) }}
SELECT ...
FROM ...
@mharrisb1 Is run_id scheme somehow differs from using standard {uuid}
clickhouse have?
i.e.
engine = "ReplicatedReplacingMergeTree('/clickhouse/tables/{layer}-{shard}/{database}/{table}/{uuid}', '{replica}')"
@tema-popov I wasn't aware of that template. That would be a much better solution than mine. I will test it out and update this issue if it works. Thanks!
@mharrisb1 Do you know if this is fixed by the latest update + the {uuid} macro?