dbt-bigquery
dbt-bigquery copied to clipboard
[ADAP-538] [Bug] Date incremental tables scanning more of table than expected
Is this a new bug in dbt-bigquery?
- [X] I believe this is a new bug in dbt-bigquery
- [X] I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
What's the problem?
I think I'm seeing a version similar to this issue?
I have a table where every days partition is approximately 2GB. Config:
{{
config(
materialized="incremental",
incremental_strategy="insert_overwrite",
partition_by={
"field": "event_date_dt",
"data_type": "date",
},
cluster_by=["event_name"],
)
}}
I'm doing incremental builds on this table in order to reduce the size of runs. However when the incremental build runs it appears to be scanning more the table than I'd expect?
Here's the whole run (with the delete tmp step stripped so I could inspect the tmp table).
The initial incremental build picks up 3 days here which is approximately 6GB (as expected).
It then picks the 3 days to be replaced:
Then the merge query however proceeds to query 15 GB? The 3 days it's replacing are approximately 6GB. How is this scaling to 15GB?
merge into `project_id`.`analytics_models`.`ga4_stg_events` as DBT_INTERNAL_DEST
using (
select
* from `project_id`.`analytics_models`.`ga4_stg_events__dbt_tmp`
) as DBT_INTERNAL_SOURCE
on FALSE
when not matched by source
and date(DBT_INTERNAL_DEST.event_date_dt) in unnest(dbt_partitions_for_replacement)
then delete
when not matched then insert
(`event_date_dt`, `event_timestamp`, another 30 columns truncated for brevity)
values
(`event_date_dt`, `event_timestamp`, another 30 columns truncated for brevity)
The previous issue suggested it was an issue with DBT wrapping date:
and date(DBT_INTERNAL_DEST.event_date_dt) in unnest(dbt_partitions_for_replacement)
However in this case removing the date wrapper, does not change anything.
I think a temporary fix is to turn on copy partitions, because the first part of the query is approx 6GB and I think copy partitions is free. (Although if you use copy partitions DBT doesn't record query size for the first steps so I'm validating by running the queries manually in BQ and measuring).
But is this a bug? It doesn't feel like intended behaviour.
partition_by={
"field": "event_date_dt",
"data_type": "date",
"copy_partitions": true
},
Expected Behavior
I would expect only 3 partitions to be scanned which would only cost ~ 6GB.
Steps To Reproduce
- Generate a table partitioned by date
- Run an incremental build
Relevant log output
No response
Environment
1. dbt-core: 1.5
2. dbt-bigquery: 1.5
2. Windows 10
3. Python 3.8.6
Additional Context
No response
Thanks for raising this @dom-devel !
Do you know of any other way other than bytes processed to inspect how many partitions are pruned vs. how many are scanned in BigQuery?
If there is some method with deterministic results that we could get the number of skipped partitions vs scanned ones, that would make it easier to create functional tests for your report (and others).
Hmm no unfortunately not :(
When we're building models we always do a full-refresh and then a non-full refresh as part of the building process.
At that point we tend to catch if the wrong number of partitions are being scanned, but it's definitely not a great process as these issues only often get spotted on the really big tables, when someone spots we're re-running a 40GB query on an incremental run.
Merge query scans both the source table and target table. I think the total size of both tables is probably 15GB.
I'm also running into this problem of BigQuery running a full scan of the destination table during incremental updates, which is costly when the table is large. I have confirmed that this part of the merge into
statement:
and datetime_trunc(DBT_INTERNAL_DEST.dt_column, day) in unnest(dbt_partitions_for_replacement)
has no effect on the amount of data processed by BQ -- BQ says "This query will process X when run", where X equals the current size of the destination table, whether or not the above line is commented out.
My destination table has the following config:
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
"field": "dt_column",
"data_type": "datetime",
"granularity": "day"
},
)
}}
Any update on the issue?
@ajrheaume Probably the issue is related to #717. This issue aren't related.
This problem occurs when data_type is datetime.
I have submitted a pull request for the issue but there is no review. https://github.com/dbt-labs/dbt-bigquery/pull/993
@tnk-ysk Hi, i have a question.
Where did you find below info in #993 ?
data_type | granularity | transform | description |
---|---|---|---|
datetime | month | datetime_trunc(column, month) | full scan |
datetime | day | datetime_trunc(column, day) | full scan |
datetime | hour | datetime_trunc(column, hour) | full scan |
datetime | month | date_trunc(column, month) | valid partition filter |
datetime | day | date_trunc(column, day) | valid partition filter |
datetime | hour | date_trunc(column, hour) | valid partition filter, but not supported |
I think partition pruning works well when partition column is datetime & granularity day, but
I don't understand how pruning partitions with date_trunc
works.
Can I find it in the official document?
@jx2lee
Can I find it in the official document?
No official documentation has been found, but I have asked Google's bigquery team directly to confirm the supported behavior. Google's answer was that they only support date_trunc, so please submit a feature request if you wish.
@tnk-ysk Thanks for reply. I'll take your feedback into consideration! 🙏