dbt-databricks icon indicating copy to clipboard operation
dbt-databricks copied to clipboard

Table materialization (V2) leads to unavailability of data

Open Jeremynadal33 opened this issue 5 months ago • 10 comments

Describe the bug

TLDR; stop using create or replace and then insert when table when table already exists (with valid data)

Two problems since materialization V2 for "table":

  • The table materialization does 2 steps (create or replace + insert) that can lead to an unavailability of the data
  • Constraints can be applied before hands (thanks a lot!) but the problem is that if new data arrive violating the said constraints afterward, we recreate the table first, hence removing all data.

The problem is that Delta is not handling transactions across multiple commands... One way to do overcome the second problem would be to add a clause for putting back data using RESTORE but it leaves the first problem.

Then, I thought of using the insert_overwrite incremental strategy that would replace all data in a single transaction, hence not removing existing valid data in a violation constraint scenario but the strategy isn't complete yet (not possible to add columns if I read right) and unavailable for SQL warehouse (even though I don't understand why because here would be a valid case for overwriting completely the table, maybe should have been only a warning in the logs?).

I think using insert_overwrite (without partition predicate) in the table materialization when the relation already exists in the database would be an idea but this may lead to complications (what if a column type changes ?). Happy to discuss !

Steps To Reproduce

Create a model toto.sql :

{{
  config(
    materialized = 'table',
    )
}}

{% if load_relation(this) %} {# meaning it exists in the database #}
  select null as name
  union all
  select 'toto' as name
{% else %} {# meaning it does not exist in the database #}
  select 'toto' as name
{% endif %}

And its corresponding toto.yml

models:
  - name: toto
    config:
      contract:
        enforced: true
    columns:
      - name: name
        data_type: string
        constraints:
          - type: not_null

Then run : dbt run -s toto --> Succeed + select * from my_catalog.my_schema.toto returns 1 row dbt run -s toto --> Fails + select * from my_catalog.my_schema.toto returns 0 row

Expected behavior

I want my valid data not to be removed and a 0 downtime of my table.

System information

The output of dbt --version:

Core:
  - installed: 1.10.2
  - latest:    1.10.5 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.10.4 - Up to date!
  - spark:      1.9.2  - Update available!

  At least one plugin is out of date with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: Mac OS

The output of python --version: Python 3.13.5

Additional context

Thanks everyone for all the good work !! 🙏 I would be glad to contribute once we decide how to handle this problem ! Don't hesitate to reach out, we can even schedule a call if you have time for this.

Jeremynadal33 avatar Jul 22 '25 16:07 Jeremynadal33

@benc-db if I'm understanding correctly, this is the type of problem we want to address with multi-statement transactions, right?

ericj-db avatar Jul 22 '25 22:07 ericj-db

Yes, implementing transaction support will ensure the commit is atomic; until the new version of the table is available, all readers will see the old version of the table (assuming a reasonable implementation of transactions).

benc-db avatar Jul 22 '25 22:07 benc-db

Thanks for the quick reply ! Just to make things clear for me, you guys are talking about the coordinated commits feature coming in Delta 4.0 ?

I feel like this is a massive change, except if Databricks intends to simplify things ? I did not saw any announcements so far. I still think this is a bug in the mean time, what do you think about my proposals for a fix ? I can easily try to implement the solution I have in mind if you think that can be a good idea ?

On the other end, I did not see it but using the use_safer_relation_operations = True config can do the job since "alters" are almost instantaneous 🤔 Any ideas why this defaults to False by any chance ? However, I feel like this slows things down quite a bit !

Thanks for your work and your replies 🙏

Jeremynadal33 avatar Jul 23 '25 14:07 Jeremynadal33

Yes, use_safer_relation_operations seeks to minimize down time. The benefit of us being able to use transactions is getting that zero downtime while keeping unity catalog history intact. I think the Databricks implementation is probably being built upon coordinated commits, so yes, the implementation will be simplified before we adopt it: https://www.databricks.com/dataaisummit/session/multi-format-multi-table-multi-statement-transactions-unity-catalog

benc-db avatar Jul 25 '25 19:07 benc-db

Alright, thanks a lot for the video !! Cannot wait to see it fully integrated in Databricks !!! I just realized that the use_safer_relation_operations flag makes us loose table history...

In the mean time, what do you think about either:

  • Rework the table and incremental (when full refreshing) strategies so it does an insert overwrite instead of either safe_relation_replace or create_table_at when table already exists. We could remove the use_safer_relation_operations + it is an atomic operation that preserve existing constraints ! However, we need to handle all schema changes at each run
  • Way easier (but we still cant handle the full refresh problems for incremental models): allow insert_overwrite strategy for SQL warehouse and add a clear warning for users so they understand what it does (also in the documentation I guess)
  • Do nothing for now and wait for MSTs to get into GA (I can do my own materialization in the mean time)

Thanks again for your time 🙏

Jeremynadal33 avatar Jul 29 '25 09:07 Jeremynadal33

@benc-db I got also this issue and solving the downtime using use_safer_relation_operations. But the __dbt_stg tables persist, is that the normal behavior?

alxsbn avatar Oct 01 '25 16:10 alxsbn

@alxsbn that is not expected, can you raise as a separate issue? I would only expect that to happen if the task fails (in that case it sticks around so that you can investigate the failure).

benc-db avatar Oct 01 '25 16:10 benc-db

btw @benc-db I got two persons from Databricks (Ben Mathew and Prakhar Jain) telling me that DDL operations won't be supported in MSTs.

I think we should find another way of dealing with table creation with constraints. use_safer_relation_operations does not work then...

Still available for brainstorming if needed !

Jeremynadal33 avatar Oct 01 '25 16:10 Jeremynadal33

telling me that DDL operations won't be supported in MSTs.

Can you provide more details?

benc-db avatar Oct 01 '25 16:10 benc-db

@benc-db You're right, the staging tables disappear after another run. Anyway it's annoying to have them since we build refined table on a specific catalog.schema we consumed with our analytics tool. So from time to time people could see table like this. So it could be interesting to push them to a technical schema wdyt?

alxsbn avatar Oct 02 '25 08:10 alxsbn