dbt-core
dbt-core copied to clipboard
Allow for Explicit Schema Definitions (DDLs) when Creating a Table / Model
Describe the feature
For a given DBT Model, we would like the ability to specify the column names, type, description and nullability explicitly. We are using BigQuery and this is possible within a CTAS
query.
We want this feature mainly so that we can prevent creating any tables with columns that should not be null that have null values in it. We do not want to do a check afterwards with a dbt test
, as downstream datasets /reports can potentially be using "corrupted" data.
i.e.
CREATE OR REPLACE TABLE
`dbt_examples`.`my_first_dbt_model`
-- table / model DDL starts here
(
x INT64 NOT NULL,
y INT64
)
-- table / model DDL ends here
OPTIONS() AS (
SELECT
1,
Null
);
reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_examples
Describe alternatives you've considered
Adding a nullability check in a model's test, so that dbt test
can do this check, but there are a lot of downsides to doing it this way:
- In BigQuery, these types of checks are expensive. So to do this check on every column of table post creation can get expensive, especially at scale when you have n (tables) * m (columns).
- We do not want to expose datasets to downstream models / reporting layer if a null value appears if it's not supposed to appear. We consider this "corrupted" data and that would produce incorrect results downstream.
Additional context
We mainly work with BigQuery and we know that this is possible with BigQuery, not sure about the other data warehouses.
I think it would just be overall beneficial to have the check done at creation time, rather than post creation time.
This would give more context to users in the warehouse, when they look at the table schema, ie. what is this column? is this column nullable? etc.
Who will this benefit?
Anyone using BigQuery, this would reduce their warehousing costs significantly.
Thanks for opening this thoughtful issue @ericxiao251 -- see also a previous discussion over here: https://github.com/fishtown-analytics/dbt/issues/1438
I don't have any problem at all with this on BigQuery and Snowflake. I don't think this should be the default though. We also have an issue for adding a "strict" config to schema.yml files: #1570 . Maybe there's some sensible interop between these two features? Maybe something like:
# schema.yml
models:
- name: my_model
ddl_specifies_schema: true
columns: [ ... ]
I think ddl_specifies_schema
is a terrible name, but I don't have any good ideas at present. What kind of interface are you imagining for specifying this config?
We're still very new to DBT, but my initial thoughts were to enable this option in a new version, version: 3
of the schemas.yml
, but your suggestion provides a simple and easy to use solution!
looking at https://github.com/fishtown-analytics/dbt/issues/1570, we would probably enforce this if ddl_specifies_schema: true
?
hi! is there any development on this? thanks
This would be a positive development for our team. Some positives:
- easier to migrate from legacy systems where data models are maintained in
create table
statements (or sqlalchemy models or equivalent) - single source-of-truth for a final table schema, which is nice encapsulation in cases where many tables are being union-ed together into a final materialization, and we'd have to build a casting/type enforcement model. This model becomes a nuisance if an alternative like table definitions is available?
Can take advantage of some features offered by table definitions:
- faster-failure on schema failures - rather than having to rely on "a column is not null"
dbt test
, specifying that a column isnot nullable
in a table definition results in earlier failure which can be desirable. - specifying default values
Now, as an alternative for those looking to utilize create table
statements in their current dbt project, we came up with a hack:
- specify your model as a
view
- add a post-hook e.g.
insert_into_table(this)
that does acreate table like...
some predefined but empty table object with your desired table definition, and insert from the view into that table, doing the name swapping/gymanstics to ensure the final table has the right name. - alternative, the post-hook can specify your
create table
statement and use that to create the table and insert the view results into the table; like previous solution, name swapping required to ensure the right outcome.
some pseudocode:
{% macro insert_into_predefined_model(this, stage_prefix) %}
{% set entity = source('predefined', this.name|replace(stage_prefix, '')) %}
{% set sql %}
create or replace transient table {{ temp_target }} as ( select * from {{ entity }} limit 0 );
{% for col in additional_fields %}
{% if col.name not in expected_fields %}
{{ exceptions.raise_compiler_error("When compiling " ~ this ~ "... found an unexpected field: " ~ col) }}
{% endif %}
alter table {{ temp_target }} add column {{ col.name }} {{ col.data_type }};
{% endfor %}
insert into {{ temp_target }} ({{ dbt_utils.star(from=this) }}) select * from {{ this }};
drop view if exists {{ this }};
alter table {{ temp_target }} rename to {{ this }}
{% endset %}
{{ return(sql) }}
{% endmacro %}
or
{% macro insert_into_with_my_table_defn(this) %}
{% set create_table_sql %}
CREATE OR REPLACE TABLE {{ this.schema }}.blah (
a number(38,0),
b number(38,0)
...
)
{% endset %}
{% do run_query(create_table_sql) %}
{% set insert_stmt %}
insert into {{ this.schema }}.BLAH ({{ dbt_utils.star(from=this) }}) select * from {{ this }};
drop table if exists {{ this }};
alter table {{ this.schema }}.BLAH rename to {{ this }};
{% endset %}
{% do run_query(insert_stmt) %}
{% endmacro %}
I built a working example in BigQuery that @smomen suggested, so let me share it.
-
schema.yml
version: 2 models: - name: my_model columns: - name: x description: "x column" meta: type: "INTEGER" mode: "REQUIRED" - name: y description: "y column" meta: type: "INTEGER"
-
my_model.sql
{{ config( materialized='table', post_hook="{{create_table_with_table_schema()}}" ) }} SELECT 1 AS x, 2 AS y
-
create_table_with_table_schema.sql
{% macro create_table_with_table_schema() %} {% set col_obj = model.columns %} {% set table_name = model.name %} {% set tmp_table_name = "{}_tmp".format(table_name) %} {%- set sql %} create or replace table {{this.schema}}.{{tmp_table_name}} ( {%- for col_name in col_obj %} {%- set type = col_obj[col_name].meta.type %} {%- set mode = col_obj[col_name].meta.mode %} {%- set description = col_obj[col_name].description %} {%- set col_definition %} {{col_name}} {{type}} {{"not null" if mode == "REQUIRED"}} {{'options(description="{}")'.format(description) if description}} {%- endset %} {%- if loop.last %} {{col_definition}} {%- else %} {{col_definition}}, {%- endif %} {%- endfor %} ); insert into {{ this.schema }}.{{ tmp_table_name }} ({{ dbt_utils.star(from=this) }}) select * from {{ this }}; drop table if exists {{ this }}; alter table {{ this.schema }}.{{ tmp_table_name }} rename to {{ table_name }}; {% endset %} {% do run_query(sql) %} {% endmacro %}
- The column type and mode are specified in the
meta
property. - Since I noticed that the second
...
inalter table ... rename to ...;
only acceptstable
(notproject_id.database.table
) in BigQuery, I changedalter table {{ temp_target }} rename to {{ this }}
toalter table {{ this.schema }}.{{ tmp_table_name }} rename to {{ table_name }};
.
Friends, worth keeping an eye on https://github.com/dbt-labs/dbt-core/issues/6079 which might satisfy this issue mostly. You'll get not-null enforcement at execution time. @mmiyahara what do you think?
@smomen Thank you for letting me know. I agree that https://github.com/dbt-labs/dbt-core/issues/6079 might solve this issue. I'll keep an eye on it.
Is this issue solved by https://github.com/dbt-labs/dbt-core/pull/6271 ? Model contracts are a new feature that dbt-labs has been working on.
Short answer: yes!
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.