dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

[Bug] Unit tests not working for incremental models

Open CraigWilson-ZOE opened this issue 1 year ago • 9 comments

Is this a new bug in dbt-core?

  • [X] I believe this is a new bug in dbt-core
  • [X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When running a unit tests for an incremental model I am getting the following error:

14:41:49    Database Error in unit_test unit_test__int_barcode_scanned_events_incremental_mode (models/silver/intermediate/food_logging/tests.yml)
  Syntax error: Unexpected "." at [18:44]

Expected Behavior

The unit tests to pass

Steps To Reproduce

  1. Environment is:
  • dbt 1.8.6
  • dbt bigquery 1.8.2
  • dbt-adapters 1.5.0
  • dbt-utils 1.3.0
  1. incremental model configuration:
{{
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        unique_key='event_id',
        partition_by={
            "field": "date_day",
            "data_type": "date"
        }
    )
}}
  1. Unit test config:
unit_tests:
  - name: unit_test__int_barcode_scanned_events_incremental_mode
    description: Unit test for int_barcode_scanned_events in incremental mode
    model: int_barcode_scanned_events
    overrides:
      macros:
        is_incremental: true
    given:
        - input: ref('stg_insights_mixpanel_event')
          format: csv
          fixture: raw_mixpanel_events
        - input: this
          format: csv
          fixture: incremental_int_barcode_scanned_events
    expect:
        format: csv
        fixture: expected__int_barcode_scanned_events

Relevant log output

...
15:06:35  3 of 3 ERROR int_barcode_scanned_events::unit_test__int_barcode_scanned_events_incremental_mode  [ERROR in 1.31s]
...
15:06:58    Database Error in unit_test unit_test__int_barcode_scanned_events_incremental_mode (models/silver/intermediate/food_logging/tests.yml)
  Syntax error: Unexpected "." at [18:44]

Environment

- OS: macos 14.3.1
- Python: Python 3.12.2
- dbt: 1.8.6

Which database adapter are you using with dbt?

bigquery

Additional Context

The SQL that seems to be produced and is causing the error is the following:

/* {"app": "dbt", "dbt_version": "1.8.5", "profile_name": "default", "target_name": "development", "node_id": "unit_test.dbt_bq_elt.int_barcode_scanned_events.unit_test__int_barcode_scanned_events_incremental_mode"} */


      
        
            select distinct
                table_schema,
                table_name,
                
            case table_type
                when 'BASE TABLE' then 'table'
                when 'EXTERNAL TABLE' then 'external'
                when 'MATERIALIZED VIEW' then 'materializedview'
                else lower(table_type)
            end as `table_type`


            from `dbt-development-joinzoe`..INFORMATION_SCHEMA.TABLES
            where lower(table_name) like lower ('')
                and lower(table_name) not like lower ('')

CraigWilson-ZOE avatar Sep 10 '24 15:09 CraigWilson-ZOE

Thanks for reporting this @CraigWilson-ZOE !

If you comment out the unit test completely, does it still give you that error?

The SQL that is creating that syntax error looks like it could be coming from bigquery__get_catalog or bigquery__get_catalog_relations. Did you happen to override either of those macros or any others? If so, which ones?

See below for a simplified version of the situation you described. It worked without error for me, so I'm curious if it works for you or not?

Example

models/int_barcode_scanned_events.sql

{{
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        unique_key='event_id',
        partition_by={
            "field": "date_day",
            "data_type": "date"
        }
    )
}}

select 1 as event_id, cast('2002-02-02' as date) as date_day

models/_unit_tests.yml

unit_tests:
  - name: unit_test__int_barcode_scanned_events_incremental_mode
    description: Unit test for int_barcode_scanned_events in incremental mode
    model: int_barcode_scanned_events
    overrides:
      macros:
        is_incremental: true
    given:
      - input: this
        rows:
          - {event_id: 1, date_day: 2001-01-01}
    expect:
        rows:
          - {event_id: 1, date_day: 2002-02-02}

Commands to run:

dbt build -s int_barcode_scanned_events

Output:

$ dbt build -s int_barcode_scanned_events    
 
13:41:43  Running with dbt=1.8.6
13:42:10  Registered adapter: bigquery=1.8.2
13:42:11  Found 1 model, 473 macros, 1 unit test
13:42:11  
13:42:27  Concurrency: 10 threads (target='blue')
13:42:27  
13:42:27  1 of 2 START unit_test int_barcode_scanned_events::unit_test__int_barcode_scanned_events_incremental_mode  [RUN]
13:42:33  1 of 2 PASS int_barcode_scanned_events::unit_test__int_barcode_scanned_events_incremental_mode  [PASS in 5.48s]
13:42:33  2 of 2 START sql incremental model dbt_dbeatty.int_barcode_scanned_events ...... [RUN]
13:42:40  2 of 2 OK created sql incremental model dbt_dbeatty.int_barcode_scanned_events . [SCRIPT (56.0 Bytes processed) in 7.21s]
13:42:40  
13:42:40  Finished running 1 unit test, 1 incremental model in 0 hours 0 minutes and 29.38 seconds (29.38s).
13:42:40  
13:42:40  Completed successfully
13:42:40  
13:42:40  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

dbeatty10 avatar Sep 11 '24 13:09 dbeatty10

can I pick it ?

tsafacjo avatar Sep 11 '24 18:09 tsafacjo

@tsafacjo We're not yet sure if this is a bug in dbt-core or if the root cause is something else.

If you're looking for a place to contribute to dbt, there's a list of issues we've labeled as "help wanted" here.

dbeatty10 avatar Sep 11 '24 21:09 dbeatty10

ok thanks.

tsafacjo avatar Sep 11 '24 21:09 tsafacjo

Sorry for the late reply, I have been on vacation.

If you comment out the unit test completely, does it still give you that error?

I have commented out the unit test and I do not get the error anymore.

The SQL that is creating that syntax error looks like it could be coming from bigquery__get_catalog or bigquery__get_catalog_relations. Did you happen to override either of those macros or any others? If so, which ones?

No we do not override any of those macros.

See below for a simplified version of the situation you described. It worked without error for me, so I'm curious if it works for you or not?

I have tried with the super simple version and it passed without error.

{{
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        unique_key='event_id',
        partition_by={
            "field": "date_day",
            "data_type": "date"
        }
    )
}}

select 1 as event_id, cast('2002-02-02' as date) as date_day
  - name: unit_test__int_barcode_test_increment
    description: Unit test for int_barcode_test
    model: int_barcode_test
    overrides:
      macros:
        is_incremental: true
    given:
      - input: this
        rows:
          - {event_id: 1, date_day: 2001-01-01}
    expect:
        rows:
          - {event_id: 1, date_day: 2002-02-02}

NOTE: I created a new model so I could run both at the same time. My original model still fails with the same error.

CraigWilson-ZOE avatar Sep 30 '24 08:09 CraigWilson-ZOE

Also worth posting here is my packages.yml, I have removed all unecessary packages and down to the bare minimum for the who project:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0
  - package: fivetran/mailchimp_source
    version: [">=0.4.0", "<0.6.0"]
  - package: fivetran/mailchimp
    version: [">=0.7.0", "<0.10.0"]
  - package: fivetran/zendesk_source
    version: [">=0.10.0", "<0.14.0"]
  - package: fivetran/zendesk
    version: [">=0.10.0", "<0.14.0"]
  - package: fivetran/google_ads_source
    version: [">=0.9.1", "<0.11.0"]
  - package: fivetran/google_ads
    version: [">=0.9.1", "<0.11.0"]
  - package: fivetran/facebook_ads
    version: [">=0.6.0", "<0.8.0"]
  - package: fivetran/facebook_ads_source
    version: [">=0.6.0", "<0.8.0"]

CraigWilson-ZOE avatar Sep 30 '24 10:09 CraigWilson-ZOE

I have tried with the super simple version and it passed without error.

Okay! The next step is to try to isolate whichever piece(s) of your model that is triggering the syntax error when it is executed via dbt unit tests.

Are you able to tweak that super simple version so that it is closer and closer to your actual model until you find something that breaks the simplified version?

dbeatty10 avatar Sep 30 '24 17:09 dbeatty10

Hi @CraigWilson-ZOE, I understand that you are sharing the compiled SQL rather than the model, correct? Could you please share it with me? I’d like to see if it’s similar to the other issues I was looking into in this PR: https://github.com/dbt-labs/dbt-core/pull/10849. If the model has any kind of var, env_var, or other thins that cames from context this can cause this kind of error during the unit test phase

devmessias avatar Oct 16 '24 22:10 devmessias

Why I'm saying that @CraigWilson-ZOE . Your compiled sql is wrong due the double "..", which will be the result of unit test if you have a model like this

select
    count(*) as result
from {{target.database}}.{{env_var('schema')}}.INFORMATION_SCHEMA.TABLES

result

select
    count(*) as result
from dbt.INFORMATION_SCHEMA.TABLES

devmessias avatar Oct 16 '24 22:10 devmessias

Are there any updates on this? I have the same problem @dbeatty10

kangshung avatar Oct 22 '24 13:10 kangshung

I think it's a good idea if you post the model.sql file you're having issues , @kangshung

devmessias avatar Oct 22 '24 13:10 devmessias

@devmessias sure. Problem appears when both source and target model have same names, here "example". I reproduced the problem below. This model

{{
    config(
        materialized='incremental',
        incremental_strategy='merge'
    )
}}

-- IMPORTS https://discourse.getdbt.com/t/ctes-are-passthroughs-some-research/155
with cte1 as (
    select * from {{ source('my_dataset', 'example') }}
),

-- BUSINESS LOGIC
cte2 as (
    select
        id,
        TIMESTAMP_MILLIS(ts) as time_stamp,
        val
    from cte1
)

-- FINAL SELECT
select * from cte2


{% if is_incremental() %}
    where time_stamp > (select COALESCE(MAX(time_stamp), '1900-01-01') from {{ this }})
{% endif %}

and unit test like this:

unit_tests:
  - name: incremental_bug_example6
    model: example
    overrides:
      macros:
        is_incremental: true
    given:
      - input: source('my_dataset', 'example')
        rows:
          - { id: 1, ts: 1577836800000, val: 'mytext' }
          - { id: 2, ts: 1577923200000, val: 'mytext' }
      - input: this
        # contents of current model
        rows:
          - { id: 1, time_stamp: '2020-01-01T00:00:00', val: 'mytext' }
    expect:
      # what will be inserted/merged into model
      rows:
        - { id: 2, time_stamp: '2020-01-02T00:00:00', val: 'mytext' }

fails to generate a proper test, because both "example" nodes overwrite each other, and the final test query has only 1 fixture, instead of 2:

-- IMPORTS https://discourse.getdbt.com/t/ctes-are-passthroughs-some-research/155
with  __dbt__cte__example as (

-- Fixture for example
select safe_cast(1 as INT64) as id, safe_cast('''2020-01-01T00:00:00''' as TIMESTAMP) as time_stamp, safe_cast('''mytext''' as STRING) as val
), cte1 as (
    select * from __dbt__cte__example
),

-- BUSINESS LOGIC
cte2 as (
    select
        id,
        TIMESTAMP_MILLIS(ts) as time_stamp,
        val
    from cte1
)

-- FINAL SELECT
select * from cte2



    where time_stamp > (select COALESCE(MAX(time_stamp), '1900-01-01') from __dbt__cte__example)

You can see that __dbt__cte__example already has the time_stamp column, so TIMESTAMP_MILLIS(ts) transformation will always fail.

It was tested on dbt=1.8.8 and bigquery=1.8.3. I would expect it to work properly or at least give any reasonable exception that this is not supported.

kangshung avatar Oct 24 '24 11:10 kangshung

Hi @kangshung, I see! So, this seems to be a different issue from the one originally reported here, right? From my testing, it doesn’t appear to be related to the incremental setting itself, but rather the root cause is a name collision problem between the source and target model. I agree this is an issue. Would it make sense to open a separate issue describing this specific problem?

devmessias avatar Oct 25 '24 14:10 devmessias

This is a minimal example to obtain the same issue pointed by @kangshung

#_unit_tests.yaml
version: 2

unit_tests:
  - name: test_2
    model: my_model
    given:
      - input: source('source_name', 'my_model')
        rows:
          - { id: 1, my_value: 111, name: 'mytext' }
          - { id: 2, my_value: 222, name: 'mytext' }
      - input: this
        rows:
          - { id: 2, my_value: 333, name: 'mytext' }
          - { id: 1, my_value: 666, name: 'mytext' }

    expect:
      rows:
        - { id: 1, my_value: 111, name: 'mytext' }

-- my_model.sql
WITH something as (
  select
  *
  from {{ source("source_name", "my_model") }}
)
select
  *
from  something
    where id = (
      select
        id
      from {{this}}
      order by my_value desc
      limit 1
    )

# _source.yaml

version: 2

sources:
  - name: 'source_name'
    schema: 'main'
    tables:
      - name: example
      - name: my_model

Test will pass if we change the source to example instead of my_model

devmessias avatar Oct 25 '24 16:10 devmessias

Ok, no need to create a new issue; your problem has already been described here." https://github.com/dbt-labs/dbt-core/issues/10433

devmessias avatar Oct 25 '24 17:10 devmessias

~Thank you all for your research finding reproduction case for this scenario 🏆~

~Closing as a duplicate of https://github.com/dbt-labs/dbt-core/issues/10433.~

Edit: re-opened in https://github.com/dbt-labs/dbt-core/issues/10686#issuecomment-2438971930

dbeatty10 avatar Oct 25 '24 22:10 dbeatty10

Oops, I closed it too hastily!

@kangshung as @devmessias said, what you reported in https://github.com/dbt-labs/dbt-core/issues/10686#issuecomment-2435041601 is a duplicate of https://github.com/dbt-labs/dbt-core/issues/10433

But we still don't have steps to reproduce for what @CraigWilson-ZOE originally reported in this issue.

@CraigWilson-ZOE does https://github.com/dbt-labs/dbt-core/issues/10686#issuecomment-2418083913 look similar to your model that is having the issues with unit testing?

dbeatty10 avatar Oct 25 '24 22:10 dbeatty10

Hi All,

Sorry I have been away from my computer for a while.

Now that I am back and had some time to dive into this, it seems like a call to an internal dbt-core macro is failing. The SQL that is being executed is:

            select distinct
                table_schema,
                table_name,
                
            case table_type
                when 'BASE TABLE' then 'table'
                when 'EXTERNAL TABLE' then 'external'
                when 'MATERIALIZED VIEW' then 'materializedview'
                else lower(table_type)
            end as `table_type`


            from `development`..INFORMATION_SCHEMA.TABLES
            where lower(table_name) like lower ('')
                and lower(table_name) not like lower ('')

I believe this is coming from get_relations_by_pattern. We use this to generate some SQL for the incremental model, and it seems like the schema section for this macro isn't getting set within dbt within a unit test. So it isn't directly related to the unit tests, but is a side effect of the overall environment that is setup within dbt during unit test executions.

Is there a reason that the schema wouldn't be set in this instance?

I have checked via logging output of the target and this objects, the this object is empty for all parts of the model information, whereas target object does have the information for the most part.

I do think this is a bug and needs to be fixed otherwise we are saying that some macros cannot be used in a incremental model.

CraigWilson-ZOE avatar Oct 29 '24 13:10 CraigWilson-ZOE

Hi @CraigWilson-ZOE, this seems similar to https://github.com/dbt-labs/dbt-core/issues/10891 and #10139. This is not the raw code for the model, right? Can you share the raw code? As I discuss in https://github.com/dbt-labs/dbt-core/issues/10908 maybe we should allow overriding dynamic variables

devmessias avatar Oct 29 '24 14:10 devmessias

In the meantime, testing a simpler model that uses the same macro in a non-incremental model might help. I tend to believe that your issue isn't related to the model being incremental or not

devmessias avatar Oct 29 '24 14:10 devmessias

Hi @CraigWilson-ZOE, this seems similar to https://github.com/dbt-labs/dbt-core/issues/10891 and https://github.com/dbt-labs/dbt-core/issues/10139. This is not the raw code for the model, right? Can you share the raw code? As I discuss in https://github.com/dbt-labs/dbt-core/issues/10908 maybe we should allow overriding dynamic variables

Yes it does look the same as https://github.com/dbt-labs/dbt-core/issues/10891 , thanks!

I am not sure about overriding dynamic variables vs just having them work as expected by many people or provide an alternative.

In the meantime, testing a simpler model that uses the same macro in a non-incremental model might help. I tend to believe that your issue isn't related to the model being incremental or not

I agree that it isn't related to it being incremental or not, it is due to the above issue.

CraigWilson-ZOE avatar Oct 29 '24 15:10 CraigWilson-ZOE

I don't think there is an easy way to do that, as pointed in https://github.com/dbt-labs/dbt-core/issues/8499. The CTE approach will solve some limitations but will impact the performance , and will create somo additional problems according to https://github.com/dbt-labs/dbt-core/issues/8499#issuecomment-2213188607 Overriding dynamic variables using the yaml would be easier to implement as a new feature, and we could still keep the performance for unit tests, similar to the mocked behavior in Python's unittest.

But in your case, if you have just one macro call could you try to override the macro?

devmessias avatar Oct 29 '24 16:10 devmessias

@CraigWilson-ZOE could you share your line of code that you are using for dbt_utils.get_relations_by_pattern so we can see the schema argument?

I'm trying to reproduce the same error you are getting, but I'm not able when I use target.schema for the schema within dbt_utils.get_relations_by_pattern.

dbeatty10 avatar Oct 29 '24 22:10 dbeatty10

There are so many issues whose root cause is this limitation, and users are incorrectly associating it with other things. Wouldn't it be helpful to improve the unit test doc to emphasize this limitation, @dbeatty10?

devmessias avatar Oct 31 '24 13:10 devmessias

@dbeatty10

Here is a breakdown of the SQL that I have and what is causing the issue. This is a change I made to ensure that getting the latest partition data in BigQuery is much more efficient than doing a query on the target table. This has saved a lot of processing on our large tables.

Macro to get the latest partition info:

{% macro get_latest_partition_value(table, partition_data_type=none) %}
    {% if not execute %}
        {{ return(modules.datetime.date.fromisoformat('2023-01-01')) }}
    {% endif %}
    {% if not dbt_utils.get_relations_by_pattern(table.schema, table.name) %}
        {{ return(modules.datetime.date.fromisoformat('1900-01-01')) }}
    {% endif %}
    {% set node = graph.nodes.get("model." + project_name + "." + table.name) %}
    {% if node is none %}
        {{ return(none) }}
    {% endif %}
    {% if not partition_data_type %}
        {% set raw_partition_by = config.get('partition_by', none) %}
        {% if not raw_partition_by %}
            {% do exceptions.raise_compiler_error("partition_by must be set in the model config to get latest partition value") %}
        {% endif %}
        {% set partition_by = adapter.parse_partition_by(raw_partition_by) %}
        {% set partition_data_type = partition_by.data_type %}
    {% endif %}

    {%- set query -%}
        SELECT MAX(
        {% if partition_data_type == 'date' %}
            PARSE_DATE("%Y%m%d",partition_id)
        {% elif partition_data_type == 'timestamp'%}
            PARSE_TIMESTAMP("%Y%m%d",partition_id)
        {% elif partition_data_type == 'datetime' %}
            PARSE_DATETIME("%Y%m%d",partition_id)
        {% elif partition_data_type == 'int64' %}
            CAST(partition_id AS INT64)
        {% else %}
            {% do exceptions.raise_compiler_error("partition_by data type must be date, timestamp, datetime or int64") %}
        {% endif %}
        )
        FROM {{table.database}}.{{table.schema}}.INFORMATION_SCHEMA.PARTITIONS
        WHERE 
            table_name = '{{table.name}}' 
            AND partition_id IS NOT NULL 
            AND partition_id != '__NULL__' 
            AND partition_id != '__UNPARTITIONED__'
            AND total_rows > 0
    {% endset %}

    {% set results = run_query(query) %}

    {% set partition_max_value = results.columns[0].values()[0] %}
    {{ return(partition_max_value) }}

{% endmacro %}

And it is used like this (example):

{% set partition_date = get_latest_partition_value(this) %}
{{
    config(
        materialized='incremental',
        unique_key='correlation_id',
        partition_by={
            "field": "search_date",
            "data_type": "date"
        },
        cluster_by=["message_type"],
    )
}}

WITH

all_search_logs AS
(
    SELECT DISTINCT
        correlation_id,
        timestamp AS search_time,
        username,
        payload,
        origin_service,
        message_type,
        DATE(timestamp) AS search_date,
        COALESCE(CAST(JSON_EXTRACT_SCALAR(metadata, '$.with_scores') AS BOOL), TRUE) AS with_scores
    FROM
        {{ source('pubsub_dataflow', 'pubsub_to_bigquery_partitioned') }}
    {% if target.name != 'production' %}
        WHERE
            timestamp >= (CAST(DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AS TIMESTAMP))
    {% else %}
        {% if is_incremental() and partition_date is not none %}
            WHERE
                timestamp >= TIMESTAMP('{{ partition_date }}')
        {% endif %}
    {% endif %}
)

SELECT * FROM all_search_logs

The table parameter in the macro during the unit test execution is NULL/NONE whereas in normal running or compilation it is correctly set as the target table.

CraigWilson-ZOE avatar Oct 31 '24 15:10 CraigWilson-ZOE

This will not work with the current limitations of unit tests. However, you can override the macro since it is being called only once. In that case, having a feature that allows to override a dynamic variable is not necessary.

devmessias avatar Oct 31 '24 16:10 devmessias

Thanks for providing those details @CraigWilson-ZOE ! I was able to get the same error as you with a simplified version of your scenario. And I was also able to resolve it by providing an override for the get_latest_partition_value macro in the unit test definition.

Explanation

It looks like these lines are the ones that won't work the way you want when you execute them in a unit test:

        FROM {{table.database}}.{{table.schema}}.INFORMATION_SCHEMA.PARTITIONS
        WHERE 
            table_name = '{{table.name}}' 

There's an explanation in https://github.com/dbt-labs/dbt-core/issues/10891#issuecomment-2444663406.

The TLDR is that within the context of unit tests, this is a str rather than a Relation, so things like {{table.schema}} will be empty.

Workaround

Could you try providing an override for your get_latest_partition_value macro like suggested by @devmessias ?

It would look something like this:

  - name: unit_test__int_barcode_scanned_events_incremental_mode
    description: Unit test for int_barcode_scanned_events in incremental mode
    model: int_barcode_scanned_events
    overrides:
      macros:
        is_incremental: true
        get_latest_partition_value: SOMETHING_HERE

dbeatty10 avatar Nov 01 '24 21:11 dbeatty10

Ah that makes a lot of sense, and thanks for the example and explanation. Happy for this to be closed as I think a lot of it is covered in other issues as linked above.

CraigWilson-ZOE avatar Nov 04 '24 08:11 CraigWilson-ZOE

Thanks again for raising this and helping home in on the root causes @CraigWilson-ZOE. Going to close this in favor of https://github.com/dbt-labs/dbt-core/issues/10891 and other relevant issues linked above.

dbeatty10 avatar Nov 04 '24 22:11 dbeatty10