dbt-clickhouse icon indicating copy to clipboard operation
dbt-clickhouse copied to clipboard

Table with Null engine not created on cluster

Open benjamin-awd opened this issue 1 year ago • 4 comments

Describe the bug

Tables with null engines are not created using the ON CLUSTER clause

Steps to reproduce

  1. Attempt to create a relation with a null engine e.g.
{{
    config(
        engine="Null",
        materialized="materialized_view",
        schema="default",
        cluster="default"
    )
}}

select * from {{ ref("stg_testing") }}
  1. Materialized view + target table is only created on a single server, and not across the entire cluster

Expected behaviour

There should be an option to allow a null engine table to be created on cluster

Code examples, such as models or profile settings

Also tried defining the cluster via profiles.yml, but this makes no difference

benjamin-awd avatar Jul 26 '24 10:07 benjamin-awd

I figure the quick fix for this is to update the logic in get_on_cluster():

    @classmethod
    def get_on_cluster(
        cls: Type[Self], cluster: str = '', materialized: str = '', engine: str = ''
    ) -> bool:
        if cluster.strip():
            return (
                materialized in ('view', 'dictionary')
                or 'distributed' in materialized
                or 'Replicated' in engine
+               or engine in ('Null')
            )

        else:
            return False

benjamin-awd avatar Jul 26 '24 10:07 benjamin-awd

Another relevant issue: https://github.com/ClickHouse/dbt-clickhouse/issues/237

benjamin-awd avatar Jul 26 '24 10:07 benjamin-awd

The current on_cluster macro is really overloaded. It is a conditional macro that checks for things other than cluster.

For example, I wanted to use it in a post-hook for configuring Row Policies but basically couldn't.

Perhaps instead of adding more and more cases to the if statement, we can refactor it to work by default when called, and fix the problems from its misuse in other macros?

Or we should leave the current one as is since it is used everywhere, but make a real_on_cluster_macro that does not do any checking of the relation, just returns the cluster correctly formatted as 'ON CLUSTER <cluster>' if cluster exists... :D . The current one can use the real_on_cluster_macro under the hood. It should really be called conditional_on_cluster_macro since it conditionally returns on_cluster if some conditions not specified in the README are met.

This is a sincere suggestion, what do you think @benjamin-awd? Then you could use it with any type of Engine etc. and the conditional one can be used by those implementations that need more logic such as distributed/Replicated materializations..

emirkmo avatar Jul 26 '24 14:07 emirkmo

@emirkmo I ended up creating a force_on_cluster config argument that should add the ON CLUSTER clause regardless of engine or materialization type in https://github.com/ClickHouse/dbt-clickhouse/pull/328

Perhaps instead of adding more and more cases to the if statement, we can refactor it to work by default when called, and fix the problems from its misuse in other macros?

This is a good point -- I prefer that option (something like this) that adds the ON CLUSTER clause to everything by default if the cluster argument is defined but I'm not sure what consequences that would have.

As a user, when I specify the cluster argument in the config, I'd expect my relations to be created on the cluster. If a user wants to only create a model on a single connected shard, then they should remove the cluster argument from their model (or profiles).

benjamin-awd avatar Jul 28 '24 05:07 benjamin-awd