dbt-databricks Support new `constraints` spec, and specifying constraints in `create table as`

Support new `constraints` spec, and specifying constraints in `create table as`

Open jtcohen6 opened this issue 2 years ago • 1 comments

trafficstars

(Happy to split this into multiple issues — keeping it together here for now, to keep the conversation centralized)

Describe the feature

First, we are changing the spec of constraints to be a first-class property/configuration, and should update the implementation in dbt-databricks to match (rather than the current meta-based approach).

Changes to dbt-core:

https://github.com/dbt-labs/dbt-core/issues/7066
https://github.com/dbt-labs/dbt-core/issues/7067

Associated changes to dbt-spark:

https://github.com/dbt-labs/dbt-spark/issues/655
https://github.com/dbt-labs/dbt-spark/issues/656
https://github.com/dbt-labs/dbt-spark/issues/657

Second: This is a feature request for Databricks, more so than a feature request for dbt-databricks!

create table <table> (<column_name> <data_type> not null) as (
    select <query>
);

https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html

This optional clause populates the table using the data from query. When you specify a query you must not also specify a column_specification. The table schema will be derived form the query.

Context & alternatives

Databricks supports many types of constraints, and it can enforce two: not null constraints on columns in a table, and check constraints of boolean expressions on one or more columns in a table.

However, Databricks does not support the inclusion of a column spec / constraint within CTA (create or replace table <tablename> as <sql>). This puts us in a tricky situation when trying to atomically replace a table, and enforce constraints on the data being added in. These are the options, as we understand them:

create or replace the table with the constraints, then insert the data returned by the query, with the constraints enforced on any data being added in. Because Databricks does not support transactions, while the second statement (insert) is running, the table will appear empty to downstream queriers.
Create a new table with the constraints, in a separate location, then insert the data returned by the query, with the constraints enforced. If all constraints pass, move the new table to the preexisting table’s location, requiring a “deep” clone that can be very slow (effectively copying the full dataset over from one location to another).
Atomically replace the table via create or replace table ... as, ensuring zero downtime and no need to move data, then apply the constraints after the fact via alter table statements. Unfortunately, this means that the constraints aren’t actually enforced until after the fact—no better than existing dbt test—and that the model’s table can therefore include data that violates its contracted expectations.

Is that understanding correct? For now, we have opted for option 3 in dbt-spark (https://github.com/dbt-labs/dbt-spark/pull/574), as it is closest to the existing pattern for atomically replacing models and testing them after-the-fact. (It's also the existing implementation in dbt-databricks, which allows defining constraints in the meta dictionary.) The ideal is gaining the ability to include constraints within a CTA (create or replace table … as).`

Who will this benefit?

Users of "model contracts" in dbt v1.5+ (beta documentation: https://docs.getdbt.com/docs/collaborate/publish/model-contracts)

Are you interested in contributing this feature?

For the first bit, the code changes should look a lot like the changes that we'll be making in dbt-spark.

For the second, I believe this would require a change to Apache Spark or Databricks, so I don't think I'd know how :)

Mar 07 '23 23:03 jtcohen6

dbt-databricks dbt-databricks copied to clipboard

Support new `constraints` spec, and specifying constraints in `create table as`

Describe the feature

Context & alternatives

Who will this benefit?

Are you interested in contributing this feature?

dbt-databricks
dbt-databricks copied to clipboard