dbt-clickhouse icon indicating copy to clipboard operation
dbt-clickhouse copied to clipboard

Resolve Change Relation Type error between Distributed and non-distributed Materializations

Open gfunc opened this issue 2 years ago • 4 comments

Summary

resolve #205 Changing relation type from non-distributed materialization to distributed materialization or otherwise has confusing error messages and full-refresh is not helpful.

Changes

  1. Added a macro validate_relation_existence to:
    1. raise an error when existing relation's on cluster status are not as expected (e.g. relation first created using table materialization then changed to distributed_table)
    2. drop existing relation when a full-refresh flag is provided
    3. drop existing relation when relation type is changed (view to table or otherwise) since view materialization is defaulted to be created on cluster
  2. Added a test in test_changing_relation_type.py
  3. Debug macro clickhouse__list_relations_without_caching, if the cluster has only one node, set is_on_cluster to true.

Checklist

Delete items not relevant to your PR:

  • [X] Unit and integration tests covering the common scenarios were added
  • [X] A human-readable description of the changes was provided to include in CHANGELOG

gfunc avatar Nov 10 '23 12:11 gfunc

Thanks for the investigation here @gfunc. I think we should limit the effect to distributed models only, since I'm uncomfortable adding a lot of code (and extra SQL queries) to validate the the table exists in the non-distributed case. In theory the state should be totally controlled by dbt. If we take the validate piece out of the main table materialization does a full refresh still create a "clean" state for the non-distributed case?

genzgd avatar Nov 23 '23 22:11 genzgd

Hi @genzgd, Thanks for your comment. I think the answer is no. To my understanding, the table materialization (not incremental) now is not affected by the full-refresh flag much except for grants. I will validate that shortly. Also maybe some extra SQLs to handle unexpected existing relations are reasonable for a dbt controlled env? I added only one extra SQL to handle full-refresh, existing relations are loaded from the cache, not affecting performance.

gfunc avatar Nov 29 '23 03:11 gfunc

Hi @gfunc,

Can we as a start, mitigate the changes only for distributed cases as @genzgd suggested? (and add a feature flag /config for the validation part?)

BentsiLeviav avatar Jun 09 '24 12:06 BentsiLeviav

Hi @BentsiLeviav, It could be managed. I could drop those changes in table and incremental materialization. But IMO the main problem is that we should make full-refresh effective in situations where relation materialization is changed from table to distributed_table and vice-versa. But the handling of the full-refresh flag is located in materializations, Do you have any suggestions?

gfunc avatar Jun 11 '24 07:06 gfunc