dbt-databricks replace_where fails when jinja statement is used in both predicate and select

replace_where fails when jinja statement is used in both predicate and select

Open andreas-gjensidige opened this issue 1 year ago • 2 comments

Describe the bug

If an incremental model using the replace_where strategy uses a jinja-statement in the incremental predicate and in the select then a Runtime Error is raised if the jinja-statement value has changed between two runs of the model and nothing has changed in the model or any macros called by the model.

We ran into this when we created a macro that returned the previous month and used that in the incremental predicate. When the month changed we got a Runtime Error when we tried to run the models. It seems like the previous month was used in the INSERT-part of the SQL-statement while the current month was used in the SELECT-part of the SQL-statement.

This seems, to someone who doesn't know the internals of dbt, to be caused by different caching happening in the SELECT part of the model and the INSERT part of the model, since when I look at the code in target/compiled the new value of the jinja-statement is used, but in the INSERT-statement in target/run the old value is used.

Steps To Reproduce

Create a model with the following contents, I called it test_model.sql:

{%- set predicate = run_started_at -%}

{% set INCREMENTAL_PREDICATE %}
    my_column = '{{ predicate }}'
{%- endset %}

{{
    config(
        materialized='incremental',
        incremental_strategy='replace_where',
        incremental_predicates=INCREMENTAL_PREDICATE,
        alias='test_model'
        )
}}

select '{{ predicate }}' as my_column

First run the model once to do the initial load, e.g. dbt run -s test_model. Then run it two more times normally, i.e. without full-refresh. Observe that the final run fails with a Runtime Error. By looking at the compiled code in target/run/.../test_model.sql we can see that the timestamp in the insert part of the code is the timestamp of the first incremental run while if we look at the compiled code in target/compiled/.../test_model.sql the timestamp in the select part of the code is the timestamp of the second incremental run.

Expected behavior

I would expect the the timestamp of the second incremental run would be used in both the INSERT- and SELECT-part of the model.

Screenshots and log output

Output from the failed run:

> dbt run -s test_model                                                                                                                                                                                              (dbt-17) ─╯
14:16:32  Running with dbt=1.7.4
14:16:33  Registered adapter: databricks=1.7.3
14:16:35  Found 491 models, 911 tests, 443 sources, 0 exposures, 0 metrics, 995 macros, 0 groups, 0 semantic models
14:16:35
14:17:02  Concurrency: 9 threads (target='test01')
14:17:02
14:17:02  1 of 1 START sql incremental model stg_gza.test_model .......................... [RUN]
14:17:04  1 of 1 ERROR creating sql incremental model stg_gza.test_model ................. [ERROR in 2.03s]
14:17:05
14:17:05  Finished running 1 incremental model in 0 hours 0 minutes and 30.80 seconds (30.80s).
14:17:06
14:17:06  Completed with 1 error and 0 warnings:
14:17:06
14:17:06    Runtime Error in model test_model (models/test_model.sql)
  Data written out does not match replaceWhere '(my_column = '2023-12-19 14:15:36.319930+00:00')'.
  CHECK constraint EXPRESSION(('my_column = 2023-12-19 14:15:36.319930+00:00)) (my_column = '2023-12-19 14:15:36.319930+00:00') violated by row with values:
   - my_column : 2023-12-19 14:16:32.485265+00:00
14:17:06
14:17:06  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

System information

The output of dbt --version:

> dbt --version                                                                                                                                                                                                                          
Core:
  - installed: 1.7.4
  - latest:    1.7.4 - Up to date!

Plugins:
  - databricks: 1.7.3 - Up to date!
  - spark:      1.7.1 - Up to date!

The operating system you're using: macOS version 13.6.3, but also reproduced on Windows 11 and on a Linux server (I'm unsure of the exact distribution).

The output of python --version: Python 3.10.13

Additional context

Add any other context about the problem here.

Dec 19 '23 14:12 andreas-gjensidige

Thanks for reporting. Does anything jump out if you look at the compiled version of the sql? Jinja is crazy...

Dec 21 '23 22:12 benc-db

dbt-databricks dbt-databricks copied to clipboard

replace_where fails when jinja statement is used in both predicate and select

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

dbt-databricks
dbt-databricks copied to clipboard