dbt-databricks
dbt-databricks copied to clipboard
Advanced MERGE incremental settings
Describe the feature
Currently merge conditions are not supported in merge strategy.
If following the Databricks terminology, merge conditions are matched_conditions and not_matched_conditions in MERGE definition.
Taken from DBricks docs:
MERGE INTO target_table_name [target_alias]
USING source_table_reference [source_alias]
ON merge_condition
{ WHEN MATCHED [ AND matched_condition ] THEN matched_action |
WHEN NOT MATCHED [ AND not_matched_condition ] THEN not_matched_action } [...]
As for now the full power of MERGE statement is not supported.
Describe alternatives you've considered
Writing customized adapter with own merge macro to perform such operations. Add conditions to configuration for model.
Additional context
Databricks is more and more often used as a tool to support a variety of data operations that was previously relied on some analytical MPPs. Specifically preparing some DWH-like normalized models for long-term entity-oriented storage and creating non-denormalized data marts.
One of the drawbacks of the merge is that update operation is basically DELETE + INSERT, so it creates some additional expenses for each record that you are updating. So it is widely used techniques to minimise such updates by additional conditions which may apply only for already observed records, and thus can't be defined as merge predicates (in dbt terminology).
Moreover, there are business cases that require that. For example, I am tracking some object and its properties as an entry of the form <o, P>. If I observe some specific instance of such record (<1, A>) I may want to skip all consequent updates if the properties are not changing. And do the update only if I see <1, B> (for example).
Who will this benefit?
Anyone who uses MERGE as a strategy for model updates and requires additional business checks for the records before making updates / inserts.
Are you interested in contributing this feature?
Probably, but due to variety of possible solutions I will appreciate the solution design suggestion to be discussed with maintainers.
Any updates on this? Seems an interesting feature
No updates. Just to note, that there's a work-around that currently might work - with dbt you have to manually rebuild the whole update set by your model. That makes the model code unbearably huge, but it works.
For me this just outlines the cases when dbt modelling approach is not nice. Also the usage of null-aware joins I think is missing as a feature, as MERGE is frequently used for denormalized "cubes" where missing dimensions are allowed to be NULL. So using <=> on merge predicates instead of = is also required. And I don't know a workaround for this.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.