dbt-databricks
dbt-databricks copied to clipboard
Add support for `INSERT ... BY NAME`
Describe the feature
Starting from Databricks Runtime 13.3 and above, it is possible to use the BY NAME clause instead of an explicit column list in an INSERT statement (documentation).
The feature would add support for this clause as part of the insert_overwrite strategy.
Additional context
This feature would be very handy for ingesting files from cloud storage.
Let's say that we have this model that inserts new partitions every day:
{{ config(materialized="incremental", incremental_strategy="insert_overwrite", partition_by=["creation_date"]) }}
SELECT *
FROM READ_FILES('/Volumes/my_catalog/my_schema/my_volume/{{ var("partition") }}/data.csv', format => 'csv')
The first time (partition=2023/09/06) the ingested data looks like this:
creation_date,a,b
2023-09-06,1,1
2023-09-06,2,2
The table produced has the following schema:
creation_date: date, a: int, b: int
The second time (partition=2023/09/07) the ingested data looks like this:
creation_date,a
2023-09-07,3
2023-09-07,4
The column b is missing. At this point, the current adapter would fail because it would execute this command:
INSERT OVERWRITE TABLE my_catalog.my_schema.my_table
PARTITION (creation_date)
SELECT
creation_date,
a,
b
FROM my_table__dbt_tmp -- A view corresponding the compiled model
but b doesn't exist in the view.
This would not be the case if we instead executed this:
INSERT OVERWRITE TABLE my_catalog.my_schema.my_table
PARTITION (creation_date) BY NAME -- By name this time
SELECT *
FROM my_table__dbt_tmp
The value for b would just be NULL.
Who will this benefit?
People who wants to ingest "bronze" data (with schema evolution in mind) even in the absence of some columns.
Related issues
#334