Table materialization (V2) leads to unavailability of data
Describe the bug
TLDR; stop using create or replace and then insert when table when table already exists (with valid data)
Two problems since materialization V2 for "table":
- The table materialization does 2 steps (create or replace + insert) that can lead to an unavailability of the data
- Constraints can be applied before hands (thanks a lot!) but the problem is that if new data arrive violating the said constraints afterward, we recreate the table first, hence removing all data.
The problem is that Delta is not handling transactions across multiple commands... One way to do overcome the second problem would be to add a clause for putting back data using RESTORE but it leaves the first problem.
Then, I thought of using the insert_overwrite incremental strategy that would replace all data in a single transaction, hence not removing existing valid data in a violation constraint scenario but the strategy isn't complete yet (not possible to add columns if I read right) and unavailable for SQL warehouse (even though I don't understand why because here would be a valid case for overwriting completely the table, maybe should have been only a warning in the logs?).
I think using insert_overwrite (without partition predicate) in the table materialization when the relation already exists in the database would be an idea but this may lead to complications (what if a column type changes ?). Happy to discuss !
Steps To Reproduce
Create a model toto.sql :
{{
config(
materialized = 'table',
)
}}
{% if load_relation(this) %} {# meaning it exists in the database #}
select null as name
union all
select 'toto' as name
{% else %} {# meaning it does not exist in the database #}
select 'toto' as name
{% endif %}
And its corresponding toto.yml
models:
- name: toto
config:
contract:
enforced: true
columns:
- name: name
data_type: string
constraints:
- type: not_null
Then run :
dbt run -s toto --> Succeed + select * from my_catalog.my_schema.toto returns 1 row
dbt run -s toto --> Fails + select * from my_catalog.my_schema.toto returns 0 row
Expected behavior
I want my valid data not to be removed and a 0 downtime of my table.
System information
The output of dbt --version:
Core:
- installed: 1.10.2
- latest: 1.10.5 - Update available!
Your version of dbt-core is out of date!
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
Plugins:
- databricks: 1.10.4 - Up to date!
- spark: 1.9.2 - Update available!
At least one plugin is out of date with dbt-core.
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
The operating system you're using: Mac OS
The output of python --version:
Python 3.13.5
Additional context
Thanks everyone for all the good work !! 🙏 I would be glad to contribute once we decide how to handle this problem ! Don't hesitate to reach out, we can even schedule a call if you have time for this.
@benc-db if I'm understanding correctly, this is the type of problem we want to address with multi-statement transactions, right?
Yes, implementing transaction support will ensure the commit is atomic; until the new version of the table is available, all readers will see the old version of the table (assuming a reasonable implementation of transactions).
Thanks for the quick reply ! Just to make things clear for me, you guys are talking about the coordinated commits feature coming in Delta 4.0 ?
I feel like this is a massive change, except if Databricks intends to simplify things ? I did not saw any announcements so far. I still think this is a bug in the mean time, what do you think about my proposals for a fix ? I can easily try to implement the solution I have in mind if you think that can be a good idea ?
On the other end, I did not see it but using the use_safer_relation_operations = True config can do the job since "alters" are almost instantaneous 🤔 Any ideas why this defaults to False by any chance ? However, I feel like this slows things down quite a bit !
Thanks for your work and your replies 🙏
Yes, use_safer_relation_operations seeks to minimize down time. The benefit of us being able to use transactions is getting that zero downtime while keeping unity catalog history intact. I think the Databricks implementation is probably being built upon coordinated commits, so yes, the implementation will be simplified before we adopt it: https://www.databricks.com/dataaisummit/session/multi-format-multi-table-multi-statement-transactions-unity-catalog
Alright, thanks a lot for the video !! Cannot wait to see it fully integrated in Databricks !!! I just realized that the use_safer_relation_operations flag makes us loose table history...
In the mean time, what do you think about either:
- Rework the
tableandincremental(when full refreshing) strategies so it does aninsert overwriteinstead of eithersafe_relation_replaceorcreate_table_atwhen table already exists. We could remove theuse_safer_relation_operations+ it is an atomic operation that preserve existing constraints ! However, we need to handle all schema changes at each run - Way easier (but we still cant handle the full refresh problems for incremental models): allow
insert_overwritestrategy for SQL warehouse and add a clear warning for users so they understand what it does (also in the documentation I guess) - Do nothing for now and wait for MSTs to get into GA (I can do my own materialization in the mean time)
Thanks again for your time 🙏
@benc-db I got also this issue and solving the downtime using use_safer_relation_operations. But the __dbt_stg tables persist, is that the normal behavior?
@alxsbn that is not expected, can you raise as a separate issue? I would only expect that to happen if the task fails (in that case it sticks around so that you can investigate the failure).
btw @benc-db I got two persons from Databricks (Ben Mathew and Prakhar Jain) telling me that DDL operations won't be supported in MSTs.
I think we should find another way of dealing with table creation with constraints. use_safer_relation_operations does not work then...
Still available for brainstorming if needed !
telling me that DDL operations won't be supported in MSTs.
Can you provide more details?
@benc-db You're right, the staging tables disappear after another run. Anyway it's annoying to have them since we build refined table on a specific catalog.schema we consumed with our analytics tool. So from time to time people could see table like this. So it could be interesting to push them to a technical schema wdyt?