dbt-databricks
dbt-databricks copied to clipboard
Support new `constraints` spec, and specifying constraints in `create table as`
(Happy to split this into multiple issues — keeping it together here for now, to keep the conversation centralized)
Describe the feature
First, we are changing the spec of constraints to be a first-class property/configuration, and should update the implementation in dbt-databricks to match (rather than the current meta-based approach).
Changes to dbt-core:
- https://github.com/dbt-labs/dbt-core/issues/7066
- https://github.com/dbt-labs/dbt-core/issues/7067
Associated changes to dbt-spark:
- https://github.com/dbt-labs/dbt-spark/issues/655
- https://github.com/dbt-labs/dbt-spark/issues/656
- https://github.com/dbt-labs/dbt-spark/issues/657
Second: This is a feature request for Databricks, more so than a feature request for dbt-databricks!
create table <table> (<column_name> <data_type> not null) as (
select <query>
);
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html
This optional clause populates the table using the data from query. When you specify a query you must not also specify a column_specification. The table schema will be derived form the query.
Context & alternatives
Databricks supports many types of constraints, and it can enforce two: not null constraints on columns in a table, and check constraints of boolean expressions on one or more columns in a table.
However, Databricks does not support the inclusion of a column spec / constraint within CTA (create or replace table <tablename> as <sql>). This puts us in a tricky situation when trying to atomically replace a table, and enforce constraints on the data being added in. These are the options, as we understand them:
create or replacethe table with the constraints, theninsertthe data returned by the query, with the constraints enforced on any data being added in. Because Databricks does not support transactions, while the second statement (insert) is running, the table will appear empty to downstream queriers.- Create a new table with the constraints, in a separate location, then
insertthe data returned by the query, with the constraints enforced. If all constraints pass, move the new table to the preexisting table’s location, requiring a “deep” clone that can be very slow (effectively copying the full dataset over from one location to another). - Atomically replace the table via
create or replace table ... as, ensuring zero downtime and no need to move data, then apply the constraints after the fact viaalter tablestatements. Unfortunately, this means that the constraints aren’t actually enforced until after the fact—no better than existingdbt test—and that the model’s table can therefore include data that violates its contracted expectations.
Is that understanding correct? For now, we have opted for option 3 in dbt-spark (https://github.com/dbt-labs/dbt-spark/pull/574), as it is closest to the existing pattern for atomically replacing models and testing them after-the-fact. (It's also the existing implementation in dbt-databricks, which allows defining constraints in the meta dictionary.) The ideal is gaining the ability to include constraints within a CTA (create or replace table … as).`
Who will this benefit?
- Users of "model contracts" in dbt v1.5+ (beta documentation: https://docs.getdbt.com/docs/collaborate/publish/model-contracts)
Are you interested in contributing this feature?
For the first bit, the code changes should look a lot like the changes that we'll be making in dbt-spark.
For the second, I believe this would require a change to Apache Spark or Databricks, so I don't think I'd know how :)