dbt-bigquery [ADAP-538] [Bug] Date incremental tables scanning more of table than expected

Is this a new bug in dbt-bigquery?

[X] I believe this is a new bug in dbt-bigquery
[X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

What's the problem?

I think I'm seeing a version similar to this issue?

I have a table where every days partition is approximately 2GB. Config:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        partition_by={
            "field": "event_date_dt",
            "data_type": "date",
        },
        cluster_by=["event_name"],
    )
}}

I'm doing incremental builds on this table in order to reduce the size of runs. However when the incremental build runs it appears to be scanning more the table than I'd expect?

Here's the whole run (with the delete tmp step stripped so I could inspect the tmp table).

The initial incremental build picks up 3 days here which is approximately 6GB (as expected).

It then picks the 3 days to be replaced:

Then the merge query however proceeds to query 15 GB? The 3 days it's replacing are approximately 6GB. How is this scaling to 15GB?


    merge into `project_id`.`analytics_models`.`ga4_stg_events` as DBT_INTERNAL_DEST
        using (
        select
        * from `project_id`.`analytics_models`.`ga4_stg_events__dbt_tmp`
      ) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and date(DBT_INTERNAL_DEST.event_date_dt) in unnest(dbt_partitions_for_replacement) 
        then delete

    when not matched then insert
        (`event_date_dt`, `event_timestamp`, another 30 columns truncated for brevity)
    values
        (`event_date_dt`, `event_timestamp`, another 30 columns truncated for brevity)

The previous issue suggested it was an issue with DBT wrapping date:

and date(DBT_INTERNAL_DEST.event_date_dt) in unnest(dbt_partitions_for_replacement)

However in this case removing the date wrapper, does not change anything.

I think a temporary fix is to turn on copy partitions, because the first part of the query is approx 6GB and I think copy partitions is free. (Although if you use copy partitions DBT doesn't record query size for the first steps so I'm validating by running the queries manually in BQ and measuring).

But is this a bug? It doesn't feel like intended behaviour.

       partition_by={
            "field": "event_date_dt",
            "data_type": "date",
            "copy_partitions": true
        },

Expected Behavior

I would expect only 3 partitions to be scanned which would only cost ~ 6GB.

Steps To Reproduce

Generate a table partitioned by date
Run an incremental build

Relevant log output

No response

Environment

1. dbt-core: 1.5
2. dbt-bigquery: 1.5
2. Windows 10
3. Python 3.8.6

Additional Context

No response

May 10 '23 10:05 dom-devel

Thanks for raising this @dom-devel !

Do you know of any other way other than bytes processed to inspect how many partitions are pruned vs. how many are scanned in BigQuery?

If there is some method with deterministic results that we could get the number of skipped partitions vs scanned ones, that would make it easier to create functional tests for your report (and others).

May 11 '23 16:05 dbeatty10

Hmm no unfortunately not :(

When we're building models we always do a full-refresh and then a non-full refresh as part of the building process.

At that point we tend to catch if the wrong number of partitions are being scanned, but it's definitely not a great process as these issues only often get spotted on the really big tables, when someone spots we're re-running a 40GB query on an incremental run.

May 16 '23 11:05 dom-devel

Merge query scans both the source table and target table. I think the total size of both tables is probably 15GB.

Oct 28 '23 08:10 tnk-ysk

I'm also running into this problem of BigQuery running a full scan of the destination table during incremental updates, which is costly when the table is large. I have confirmed that this part of the merge into statement:

and datetime_trunc(DBT_INTERNAL_DEST.dt_column, day) in unnest(dbt_partitions_for_replacement)

has no effect on the amount of data processed by BQ -- BQ says "This query will process X when run", where X equals the current size of the destination table, whether or not the above line is commented out.

My destination table has the following config:

{{ 
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        partition_by={
              "field": "dt_column",
              "data_type": "datetime",
              "granularity": "day"
            },
    )
}}

Any update on the issue?

Mar 07 '24 15:03 ajrheaume

@ajrheaume Probably the issue is related to #717. This issue aren't related.

This problem occurs when data_type is datetime.

I have submitted a pull request for the issue but there is no review. https://github.com/dbt-labs/dbt-bigquery/pull/993

Mar 07 '24 17:03 tnk-ysk

@tnk-ysk Hi, i have a question.

Where did you find below info in #993 ?

data_type	granularity	transform	description
datetime	month	datetime_trunc(column, month)	full scan
datetime	day	datetime_trunc(column, day)	full scan
datetime	hour	datetime_trunc(column, hour)	full scan
datetime	month	date_trunc(column, month)	valid partition filter
datetime	day	date_trunc(column, day)	valid partition filter
datetime	hour	date_trunc(column, hour)	valid partition filter, but not supported

I think partition pruning works well when partition column is datetime & granularity day, but I don't understand how pruning partitions with date_trunc works.

Can I find it in the official document?

Aug 16 '24 08:08 jx2lee

@jx2lee

Can I find it in the official document?

No official documentation has been found, but I have asked Google's bigquery team directly to confirm the supported behavior. Google's answer was that they only support date_trunc, so please submit a feature request if you wish.

Aug 19 '24 00:08 tnk-ysk

@tnk-ysk Thanks for reply. I'll take your feedback into consideration! 🙏

Aug 19 '24 00:08 jx2lee

dbt-bigquery dbt-bigquery copied to clipboard

[ADAP-538] [Bug] Date incremental tables scanning more of table than expected

Is this a new bug in dbt-bigquery?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

dbt-bigquery
dbt-bigquery copied to clipboard