dbt-bigquery icon indicating copy to clipboard operation
dbt-bigquery copied to clipboard

[ADAP-624] [CT-2694] [Bug] First SQL Query to Create Incremental Model Processing Does not Show Up in dbt Logs

Open codigo-ergo-sum opened this issue 2 years ago • 6 comments

Is this a new bug in dbt-core?

  • [X] I believe this is a new bug in dbt-core
  • [X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When using incremental models, dbt runs two queries when not doing a full refresh when using the merge strategy.

  1. A query to create a temp table, named something like my_model_name__dbt_tmp which gets the rows which are selected by the incremental filter
  2. A MERGE statement to take the rows in the temporary table and merge them into the destination table.

The issue is that the second query shows up in the output of the logs, but not the first. This is at least the case on BigQuery. What I mean is that I see this in the dbt logs:

21:36:51  1 of 1 START sql incremental model josh_sandbox.daily_moving_forecast .......... [RUN]
21:37:38  1 of 1 OK created sql incremental model josh_sandbox.daily_moving_forecast ..... [MERGE (70.1m rows, 29.0 GiB processed) in 46.99s]

But that 29.0 GiB processed ONLY reflects the second query running the MERGE statement. It doesn't reflect the additional data scanned during the first statement. This is a problem because in BigQuery on-demand pricing models you're charged by data scanned, and we didn't realize that we had forgotten to partition an upstream table which this incremental model queries. That query creating the temp table was costing 70GB without partitioning, after partitioning the upstream table it dropped to be 10 GB. We couldn't see the extra cost of the additional query.

I checked the BigQuery query logs and the start and end timestamps do actually reflect the runtime of both queries, which were run sequentially one after the other.

Expected Behavior

I would expect to see two separate lines in the dbt log, the first reflecting the query creating the temp table with the desired rows for the incremental, and the second reflecting the MERGE statement.

I'm reporting this as a general issue rather than a dbt-bigquery adapter issue because I think it would be cleaner in the logs to display two separate lines in general, and other adapters also support the MERGE strategy. It would also make it easier in the dbt-bigquery case to display the additional data scanned cost for the first of the two queries.

Steps To Reproduce

  1. Do an incremental run of a dbt model in bigquery
  2. Notice that only one line in the dbt log is reflected, showing the MERGE statement (even though the start and end timestamps actually reflect the two queries run) and the storage cost only reflects the MERGE statement.

Relevant log output

See above

Environment

- OS: 13.3.1 (a) 
- Python: Python 3.9.6
- dbt: 1.5.1

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

codigo-ergo-sum avatar Jun 13 '23 21:06 codigo-ergo-sum

Thank you as always for an insightful write-up @codigo-ergo-sum 🤩

I understand your rationale for opening this as a general issue in dbt-core rather than a dbt-bigquery adapter, but I'm going to transfer it there anyways for a few reasons:

  1. The issue you are experiencing only applies to dbt-bigquery currently
  2. Given 1), we can do a better job of assessing and tracking it within dbt-bigquery
  3. We can always transfer it back to dbt-core later if needed 😎

From there, either myself or someone else like @dataders will give this a closer read and response.

dbeatty10 avatar Jun 15 '23 20:06 dbeatty10

Got it, makes sense, thank you :) @dbeatty10 .

codigo-ergo-sum avatar Jun 15 '23 21:06 codigo-ergo-sum

Thanks again for opening this issue @codigo-ergo-sum 🧠

You're right!

I can definitely see practical value of what you are describing -- it would be nice to see (a proxy of) the all-up costs of running a particular model.

Intermediate queries not rolling up affects the row counts of all adapters. Bytes processed only exists in dbt-bigquery (to the best of my knowledge), and as you described, it also does not roll up. Although understandably inconvenient, this is known and expected behavior at this time. So I'm going to re-label this as a feature request rather than a bug.

Refinement needed

I don't see an elegant and focused method to get the outcome you are proposing to roll up sums of row counts, bytes processed, etc. I'm going to label this for further refinement to ask for thoughts from our Product Managers (PM).

Further ideas from members of the community welcome as well! For example, here is one proof-of-concept (POC) for emitting an artifact with the relevant atomic metrics.

Workaround

In the meantime, our advice would be to calculate full query costs via a separate mechanism (potentially via something like the leaner_query package, which I have not tried out personally yet).

dbeatty10 avatar Aug 10 '23 13:08 dbeatty10

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] avatar Feb 07 '24 01:02 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

github-actions[bot] avatar Feb 14 '24 01:02 github-actions[bot]

Any updates on this? I'm very interested in this feature as well, can we reopen this ticket?

SPTKL avatar Apr 10 '24 18:04 SPTKL

Closing as a duplicate of https://github.com/dbt-labs/dbt-bigquery/issues/602 so we can consolidate the discussion there.

dbeatty10 avatar May 09 '24 01:05 dbeatty10