dbt-core Allow for Explicit Schema Definitions (DDLs) when Creating a Table / Model

Describe the feature

For a given DBT Model, we would like the ability to specify the column names, type, description and nullability explicitly. We are using BigQuery and this is possible within a CTAS query.

We want this feature mainly so that we can prevent creating any tables with columns that should not be null that have null values in it. We do not want to do a check afterwards with a dbt test, as downstream datasets /reports can potentially be using "corrupted" data.

i.e.

CREATE OR REPLACE TABLE
  `dbt_examples`.`my_first_dbt_model`
  -- table / model DDL starts here
  ( 
    x INT64 NOT NULL,
    y INT64
  )
  -- table / model DDL ends here
  OPTIONS() AS (
  SELECT
    1,
    Null
);

reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_examples

Describe alternatives you've considered

Adding a nullability check in a model's test, so that dbt test can do this check, but there are a lot of downsides to doing it this way:

In BigQuery, these types of checks are expensive. So to do this check on every column of table post creation can get expensive, especially at scale when you have n (tables) * m (columns).
We do not want to expose datasets to downstream models / reporting layer if a null value appears if it's not supposed to appear. We consider this "corrupted" data and that would produce incorrect results downstream.

Additional context

We mainly work with BigQuery and we know that this is possible with BigQuery, not sure about the other data warehouses.

I think it would just be overall beneficial to have the check done at creation time, rather than post creation time.

This would give more context to users in the warehouse, when they look at the table schema, ie. what is this column? is this column nullable? etc.

Who will this benefit?

Anyone using BigQuery, this would reduce their warehousing costs significantly.

Mar 10 '20 17:03 ericxiao251

Thanks for opening this thoughtful issue @ericxiao251 -- see also a previous discussion over here: https://github.com/fishtown-analytics/dbt/issues/1438

I don't have any problem at all with this on BigQuery and Snowflake. I don't think this should be the default though. We also have an issue for adding a "strict" config to schema.yml files: #1570 . Maybe there's some sensible interop between these two features? Maybe something like:

# schema.yml

models:
 - name: my_model
    ddl_specifies_schema: true
    columns: [ ... ]

I think ddl_specifies_schema is a terrible name, but I don't have any good ideas at present. What kind of interface are you imagining for specifying this config?

Mar 18 '20 18:03 drewbanin

We're still very new to DBT, but my initial thoughts were to enable this option in a new version, version: 3 of the schemas.yml, but your suggestion provides a simple and easy to use solution!

looking at https://github.com/fishtown-analytics/dbt/issues/1570, we would probably enforce this if ddl_specifies_schema: true?

Mar 21 '20 02:03 ericxiao251

hi! is there any development on this? thanks

Jul 06 '22 17:07 spcvalente

This would be a positive development for our team. Some positives:

easier to migrate from legacy systems where data models are maintained in create table statements (or sqlalchemy models or equivalent)
single source-of-truth for a final table schema, which is nice encapsulation in cases where many tables are being union-ed together into a final materialization, and we'd have to build a casting/type enforcement model. This model becomes a nuisance if an alternative like table definitions is available?

Can take advantage of some features offered by table definitions:

faster-failure on schema failures - rather than having to rely on "a column is not null" dbt test, specifying that a column is not nullable in a table definition results in earlier failure which can be desirable.
specifying default values

Now, as an alternative for those looking to utilize create table statements in their current dbt project, we came up with a hack:

specify your model as a view
add a post-hook e.g. insert_into_table(this) that does a create table like... some predefined but empty table object with your desired table definition, and insert from the view into that table, doing the name swapping/gymanstics to ensure the final table has the right name.
alternative, the post-hook can specify your create table statement and use that to create the table and insert the view results into the table; like previous solution, name swapping required to ensure the right outcome.

some pseudocode:

{% macro insert_into_predefined_model(this, stage_prefix) %}

    {% set entity = source('predefined', this.name|replace(stage_prefix, '')) %}

    {% set sql %}
        create or replace transient table {{ temp_target }} as ( select * from {{ entity }} limit 0 );
        {% for col in additional_fields %}
            {% if col.name not in expected_fields %}
                {{ exceptions.raise_compiler_error("When compiling " ~ this ~ "... found an unexpected field: " ~ col) }}
            {% endif %}
            alter table {{ temp_target }} add column  {{ col.name }} {{ col.data_type }};
        {% endfor %}

        insert into {{ temp_target }} ({{ dbt_utils.star(from=this) }}) select * from {{ this }};
        drop view if exists {{ this }};
        alter table {{ temp_target }} rename to {{ this }}
    {% endset %}

    {{ return(sql) }}
{% endmacro %}

or



{% macro insert_into_with_my_table_defn(this) %}
    {% set create_table_sql %}
        CREATE OR REPLACE TABLE {{ this.schema }}.blah (
            a                                    number(38,0),
            b                                   number(38,0)
...
     )
    {% endset %}
    {% do run_query(create_table_sql) %}
    {% set insert_stmt %}
            insert into {{ this.schema }}.BLAH ({{ dbt_utils.star(from=this) }}) select * from {{ this }};
            drop table if exists {{ this }};
            alter table {{ this.schema }}.BLAH rename to {{ this }};
    {% endset %}
    {% do run_query(insert_stmt) %}

{% endmacro %}

Jul 29 '22 15:07 smomen

I built a working example in BigQuery that @smomen suggested, so let me share it.

schema.yml

version: 2

models:
  - name: my_model
    columns:
      - name: x
        description: "x column"
        meta:
          type: "INTEGER"
          mode: "REQUIRED"
      - name: y
        description: "y column"
        meta:
          type: "INTEGER"

my_model.sql

{{
    config(
        materialized='table',
        post_hook="{{create_table_with_table_schema()}}"
    )
}}

SELECT
    1 AS x,
    2 AS y

create_table_with_table_schema.sql

{% macro create_table_with_table_schema() %}
  {% set col_obj = model.columns %}
  {% set table_name = model.name %}
  {% set tmp_table_name = "{}_tmp".format(table_name) %}

  {%- set sql %}
    create or replace table {{this.schema}}.{{tmp_table_name}}
    (
      {%- for col_name in col_obj %}
        {%- set type = col_obj[col_name].meta.type %}
        {%- set mode = col_obj[col_name].meta.mode %}
        {%- set description = col_obj[col_name].description %}

        {%- set col_definition %}
          {{col_name}} {{type}} {{"not null" if mode == "REQUIRED"}} {{'options(description="{}")'.format(description) if description}}
        {%- endset %}

        {%- if loop.last %}
          {{col_definition}}
        {%- else %}
          {{col_definition}},
        {%- endif %}
      {%- endfor %}
    );
    insert into {{ this.schema }}.{{ tmp_table_name }} ({{ dbt_utils.star(from=this) }}) select * from {{ this }};
    drop table if exists {{ this }};
    alter table {{ this.schema }}.{{ tmp_table_name }} rename to {{ table_name }};
  {% endset %}

  {% do run_query(sql) %}
{% endmacro %}

screenshot 2022-08-21 17 12 55 screenshot 2022-08-21 17 13 11

The column type and mode are specified in the meta property.
Since I noticed that the second ... in alter table ... rename to ...; only accepts table (not project_id.database.table) in BigQuery, I changed alter table {{ temp_target }} rename to {{ this }} to alter table {{ this.schema }}.{{ tmp_table_name }} rename to {{ table_name }};.

Aug 21 '22 08:08 mmiyahara

Friends, worth keeping an eye on https://github.com/dbt-labs/dbt-core/issues/6079 which might satisfy this issue mostly. You'll get not-null enforcement at execution time. @mmiyahara what do you think?

Dec 15 '22 22:12 smomen

@smomen Thank you for letting me know. I agree that https://github.com/dbt-labs/dbt-core/issues/6079 might solve this issue. I'll keep an eye on it.

Dec 19 '22 03:12 mmiyahara

Is this issue solved by https://github.com/dbt-labs/dbt-core/pull/6271 ? Model contracts are a new feature that dbt-labs has been working on.

Mar 13 '23 17:03 awilliamsOM1

Short answer: yes!

Mar 13 '23 17:03 jtcohen6

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

Sep 10 '23 01:09 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

Sep 17 '23 01:09 github-actions[bot]

dbt-core dbt-core copied to clipboard

Allow for Explicit Schema Definitions (DDLs) when Creating a Table / Model

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

dbt-core
dbt-core copied to clipboard