dbt-clickhouse icon indicating copy to clipboard operation
dbt-clickhouse copied to clipboard

Standard `table` materialization does not work with ReplicateMergeTree engine

Open mharrisb1 opened this issue 2 years ago • 3 comments

Not sure if this is related to #14 since I didn't explore the attached PR, but I wanted to call out this problem specifically.

It is my understanding that the way dbt-clickhouse handles table materializations is to:

  1. Rename table (if exists) to {{ this }}__dbt_tmp or something similar (I don't have the logs in front of me)
  2. Create a new table with name {{ this }}
  3. Drop temp table

With the Replication tables, this results in an error because there already exists a table with the replica path of /clickhouse/tables/{shard}/{database}/{table_name}.

I am currently getting around this using an incremental materialization with a config like:

{{ config(
    materialized="incremental",
    engine="ReplicatedMergeTree('/clickhouse/tables/{database}/{table}', '{replica}')",
    partition_by=[...],
    order_by=[...],
    unique_key=[...],
    pre_hook="truncate table if exists {{ this }} on cluster '...'",
    inserts_only=true
) }}

I'm not sure how we could handle a simple table materialization using the current rename-create-drop strategy since I don't believe you can update the replica path after creation and I don't believe there is a random value substitution provided which could potentially allow a replica path like /clickhouse/tables/{shard}/{database}/{table_name}__{random_uuid}

mharrisb1 avatar Sep 08 '22 16:09 mharrisb1

Some of this may be solved by using the Replicated database engine. See https://github.com/ClickHouse/ClickHouse/issues/31661. I'll be doing some testing with Replicated databases over the next couple of weeks.

genzgd avatar Sep 08 '22 20:09 genzgd

Tests work with with Replicated database engine so explicit support for ReplicatedMergeTree engines will probably not be a priority.

genzgd avatar Sep 19 '22 14:09 genzgd

For anyone that stumbles across this issue and is still using ReplicatedMergeTree engine like us, here is a better workaround:

Project config

# dbt_project.yml

vars:
  # Creates a unique ID for this dbt run using the run's timestamp
  run_id: "{{ run_started_at.strftime('%Y%m%d%H%M%S%f') }}"

  # Use this engine template for all models materialized as `table` that use `ReplicatedMergeTree` engine
  # Zookeeper path will use default layout but the table name is suffixed with the run's ID
  engine_template: >
    ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/{table}_{{ var("run_id") }}', '{replica}')

Model config

-- my_table.sql

{{ config(
    materialized="table",
    engine=var("engine_template"),
    partition_by=[...],
    order_by=[...],
    unique_key=[...]
) }}

SELECT ...
FROM ...

mharrisb1 avatar Oct 12 '22 17:10 mharrisb1

@mharrisb1 Is run_id scheme somehow differs from using standard {uuid} clickhouse have? i.e. engine = "ReplicatedReplacingMergeTree('/clickhouse/tables/{layer}-{shard}/{database}/{table}/{uuid}', '{replica}')"

tema-popov avatar Nov 29 '22 11:11 tema-popov

@tema-popov I wasn't aware of that template. That would be a much better solution than mine. I will test it out and update this issue if it works. Thanks!

mharrisb1 avatar Dec 04 '22 01:12 mharrisb1

@mharrisb1 Do you know if this is fixed by the latest update + the {uuid} macro?

genzgd avatar Jan 08 '23 22:01 genzgd