dbt-clickhouse icon indicating copy to clipboard operation
dbt-clickhouse copied to clipboard

Support insert_overwrite Materialization Strategy

Open genzgd opened this issue 2 years ago • 2 comments

We may want to consider also implementing an insert+overwrite incremental materialization strategy for models with many ClickHouse partitions. This would involve dropping partitions that have new or updated data and doing a complete replacement of such partitions (possibly from a temporary table.)

To make this performant we could have to analyze the number and size of the partition affected by the update and weigh delete+insert cost against the cost of simply dropping partitions and replacing the data.

The other consideration might be how strategies interact with ClickHouse transactions when and if production transactional support is available.

genzgd avatar Jan 08 '23 22:01 genzgd

Sounds similar to what we have now internally

We introduced extra option in model config

{{
  config (
      ...
      partition_by = 'toYYYYMM(date)',
      partition_strategy = 'append', // or can be 'rewrite'
  )
}}
  • create tmp table
  • insert new data in tmp table
  • get all partitions in tmp table
  • if partition_strategy = 'append' => ATTACH PARTITION ... FROM tmp
  • if partition_strategy = 'rewrite' => REPLACE PARTITION ... FROM tmp
  • drop tmp

You just need to be careful when doing replace, to not lose anything :) For example if you have monthly partitioning you need to make sure that you update whole month, not only last X days

simpl1g avatar Jan 27 '23 14:01 simpl1g

Hi @simpl1g thank you for sharing, your approach sounds the best balance of fault-tolerance and efficiency. Could you please share an implementation with the community? I will be glad to review the contribution and give my approval to speed up the release.

Without this feature ClickHouse DBT plugin looks quite unusable for production data pipelines due to complex recovery from failures. I am looking for options to integrate DBT with Airflow to manage ClickHouse partitions based on Airflow DAG's execution date (e.g. partition tables daily and then update them via daily Airflow DAG using insert-replace strategy).

bryzgaloff avatar Sep 28 '23 14:09 bryzgaloff