dbt-bigquery
dbt-bigquery copied to clipboard
[Bug] DBT clone when partition or cluster has changed throws error
Is this a new bug in dbt-bigquery?
- [X] I believe this is a new bug in dbt-bigquery
- [X] I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
Say I have a model that has partition at the day granularity in production. I update the model to use month for partition granularity, and successfully build the model in staging. In my production step, I proceed to clone the model to production using dbt clone
. Bigquery will throw
Cannot replace a table with a different partitioning spec. Instead, DROP the table,
and then recreate it. New partitioning spec is interval(type:month,field:timestamp)
and existing spec is interval(type:day,field:timestamp)
Expected Behavior
When there's a partition or cluster change, we should drop the existing model first then clone the model with new partition / clustering.
Steps To Reproduce
Create a simple model
--- sample_model.sql
{{
config(
cluster_by = ['col_1', 'col_2'],
partition_by = {
'field': 'timestamp',
'data_type': 'timestamp',
'granularity': 'day'
},
)
}}
select
'col_1' as col_1,
'col_2' as col_2,
CURRENT_TIMESTAMP() as timestamp
run dbt run -s sample_model -t staging
or some kind of dev / staging environment you have
and run dbt clone -s sample_model -t prod --full-refresh
so the same model exists in prod
--- sample_model.sql
{{
config(
cluster_by = ['col_1'],
partition_by = {
'field': 'timestamp',
'data_type': 'timestamp',
'granularity': 'month'
},
)
}}
select
'col_1' as col_1,
'col_2' as col_2,
CURRENT_TIMESTAMP() as timestamp
now change the cluster and partition, then run
run dbt run -s sample_model -t staging
or some kind of dev / staging environment you have
and run dbt clone -s sample_model -t prod
so the same model exists in prod
you should see an error throw by bigquery telling you to drop the model first
Relevant log output
No response
Environment
- OS: Mac
- Python: 3.8.12
- dbt-core: 1.6.6
- dbt-bigquery: 1.6.7
Additional Context
similar to the behavior in building incremental models when there's a partition / clustering change https://github.com/dbt-labs/dbt-core/blob/40839c79fcf1854942f8db44d1a2648bffe0a2ab/plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql#L23-L29
https://github.com/dbt-labs/dbt-core/issues/2000
Thanks for opening @SPTKL --- I'm glad you're finding the clone
command useful.
My concern here is predominately user-experience-related, namely
when is
--full-refresh
required, and when isn't it?
Allow me to explain why I pick that as the framing question.
You propose the following as expected behavior:
When there's a partition or cluster change, we should drop the existing model first then clone the model with new partition / clustering.
I agree, but with an important caveat: only if the --full-refresh
flag (or perhaps a new flag) is passed.
My rationale is that partitioning can have large a cost impact-- especially on large tables. If a user accidentally clones a dev version of a model with new partitioning to prod and it errors because the partitioning doesn't match -- this is a good thing!
If you look at the corresponding incremental code you shared, the full DROP in lieu of REPLACE will only happen in the context of --full-refresh
In some ways, the --full-refresh
flag is a synonym for
I don't care about what's already in the destination, delete if you need!
What do you think? Let me know if I'm off base here
@dataders thanks for your comment! Yes it totally makes sense. The drop should only happen when there's --full-refresh
it also seems like without --full-refresh
, clone will throw Relation xxxxx already exists
instead of copying anything
Thank you for the conversion here! I think we will need more information to formulate the right solution at this time. @dataders brings up a very good point: the cost of partitioning a new model because of an accidental rebuild is not a great experience. I'm also wary of how to ensure a seamless user experience with the full-refresh. Due to this, I am going to push this into something we will not be tackling on our roadmap.