Support insert_overwrite Materialization Strategy
We may want to consider also implementing an insert+overwrite incremental materialization strategy for models with many ClickHouse partitions. This would involve dropping partitions that have new or updated data and doing a complete replacement of such partitions (possibly from a temporary table.)
To make this performant we could have to analyze the number and size of the partition affected by the update and weigh delete+insert cost against the cost of simply dropping partitions and replacing the data.
The other consideration might be how strategies interact with ClickHouse transactions when and if production transactional support is available.
Sounds similar to what we have now internally
We introduced extra option in model config
{{
config (
...
partition_by = 'toYYYYMM(date)',
partition_strategy = 'append', // or can be 'rewrite'
)
}}
- create tmp table
- insert new data in tmp table
- get all partitions in tmp table
- if partition_strategy = 'append' => ATTACH PARTITION ... FROM tmp
- if partition_strategy = 'rewrite' => REPLACE PARTITION ... FROM tmp
- drop tmp
You just need to be careful when doing replace, to not lose anything :) For example if you have monthly partitioning you need to make sure that you update whole month, not only last X days
Hi @simpl1g thank you for sharing, your approach sounds the best balance of fault-tolerance and efficiency. Could you please share an implementation with the community? I will be glad to review the contribution and give my approval to speed up the release.
Without this feature ClickHouse DBT plugin looks quite unusable for production data pipelines due to complex recovery from failures. I am looking for options to integrate DBT with Airflow to manage ClickHouse partitions based on Airflow DAG's execution date (e.g. partition tables daily and then update them via daily Airflow DAG using insert-replace strategy).