dbt-bigquery [CT-3486] [Feature] Include JSON-formatted query comments for all queries within incremental models

Is this a new bug in dbt-core?

[X] I believe this is a new bug in dbt-core
[X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I am seeing recently that the MERGE statement for incremental models in BigQuery doesn't seem to append the JSON decorator like other dbt queries to. For example, here's an incremental merge statement from yesterday:

merge into `onprem-analytics`.`analytics`.`fact_daily_moving_forecast` as DBT_INTERNAL_DEST
        using (
          select
            * from `onprem-analytics`.`analytics`.`fact_daily_moving_forecast__dbt_tmp`) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and date(DBT_INTERNAL_DEST.forecast_dt) in (
              current_date, date_sub(current_date, interval 1 day), date_sub(current_date, interval 2 day), date_sub(current_date, interval 3 day), date_sub(current_date, interval 4 day), date_sub(current_date, interval 5 day), date_sub(current_date, interval 6 day), date_sub(current_date, interval 7 day), date_sub(current_date, interval 8 day), date_sub(current_date, interval 9 day), date_sub(current_date, interval 10 day), date_sub(current_date, interval 11 day), date_sub(current_date, interval 12 day), date_sub(current_date, interval 13 day), date_sub(current_date, interval 14 day), date_sub(current_date, interval 15 day), date_sub(current_date, interval 16 day), date_sub(current_date, interval 17 day), date_sub(current_date, interval 18 day), date_sub(current_date, interval 19 day), date_sub(current_date, interval 20 day), date_sub(current_date, interval 21 day), date_sub(current_date, interval 22 day), date_sub(current_date, interval 23 day), date_sub(current_date, interval 24 day), date_sub(current_date, interval 25 day), date_sub(current_date, interval 26 day), date_sub(current_date, interval 27 day), date_sub(current_date, interval 28 day), date_sub(current_date, interval 29 day), date_sub(current_date, interval 30 day), date_sub(current_date, interval 31 day), date_sub(current_date, interval 32 day), date_sub(current_date, interval 33 day), date_sub(current_date, interval 34 day), date_sub(current_date, interval 35 day), date_sub(current_date, interval 36 day), date_sub(current_date, interval 37 day), date_sub(current_date, interval 38 day), date_sub(current_date, interval 39 day), date_sub(current_date, interval 40 day), date_sub(current_date, interval 41 day), date_sub(current_date, interval 42 day), date_sub(current_date, interval 43 day), date_sub(current_date, interval 44 day), date_sub(current_date, interval 45 day), date_sub(current_date, interval 46 day), date_sub(current_date, interval 47 day), date_sub(current_date, interval 48 day), date_sub(current_date, interval 49 day), date_sub(current_date, interval 50 day), date_sub(current_date, interval 51 day), date_sub(current_date, interval 52 day), date_sub(current_date, interval 53 day), date_sub(current_date, interval 54 day), date_sub(current_date, interval 55 day), date_sub(current_date, interval 56 day), date_sub(current_date, interval 57 day), date_sub(current_date, interval 58 day), date_sub(current_date, interval 59 day), date_sub(current_date, interval 60 day), date_sub(current_date, interval 61 day), date_sub(current_date, interval 62 day), date_sub(current_date, interval 63 day), date_sub(current_date, interval 64 day), date_sub(current_date, interval 65 day), date_sub(current_date, interval 66 day), date_sub(current_date, interval 67 day), date_sub(current_date, interval 68 day), date_sub(current_date, interval 69 day), date_sub(current_date, interval 70 day), date_sub(current_date, interval 71 day), date_sub(current_date, interval 72 day), date_sub(current_date, interval 73 day), date_sub(current_date, interval 74 day), date_sub(current_date, interval 75 day), date_sub(current_date, interval 76 day), date_sub(current_date, interval 77 day), date_sub(current_date, interval 78 day), date_sub(current_date, interval 79 day), date_sub(current_date, interval 80 day), date_sub(current_date, interval 81 day), date_sub(current_date, interval 82 day), date_sub(current_date, interval 83 day), date_sub(current_date, interval 84 day), date_sub(current_date, interval 85 day), date_sub(current_date, interval 86 day), date_sub(current_date, interval 87 day), date_sub(current_date, interval 88 day), date_sub(current_date, interval 89 day), date_sub(current_date, interval 90 day), date_sub(current_date, interval 91 day), date_sub(current_date, interval 92 day), date_sub(current_date, interval 93 day), date_sub(current_date, interval 94 day), date_sub(current_date, interval 95 day), date_sub(current_date, interval 96 day), date_sub(current_date, interval 97 day), date_sub(current_date, interval 98 day), date_sub(current_date, interval 99 day), date_sub(current_date, interval 100 day), date_sub(current_date, interval 101 day), date_sub(current_date, interval 102 day), date_sub(current_date, interval 103 day), date_sub(current_date, interval 104 day), date_sub(current_date, interval 105 day), date_sub(current_date, interval 106 day), date_sub(current_date, interval 107 day), date_sub(current_date, interval 108 day), date_sub(current_date, interval 109 day), date_sub(current_date, interval 110 day), date_sub(current_date, interval 111 day), date_sub(current_date, interval 112 day), date_sub(current_date, interval 113 day), date_sub(current_date, interval 114 day), date_sub(current_date, interval 115 day), date_sub(current_date, interval 116 day), date_sub(current_date, interval 117 day), date_sub(current_date, interval 118 day), date_sub(current_date, interval 119 day), date_sub(current_date, interval 120 day)
          ) 
        then delete

    when not matched then insert
        (`fact_daily_moving_forecast_key`, `opportunity_title_key`, `opportunity_title_current_key`, `forecast_dt`, `work_dt`, `dim_person_key`, `dim_person_current_key`, `dim_client_account_current_key`, `dim_project_key`, `dim_department_key`, `dim_project_current_key`, `dim_financial_account_key`, `probability_pct`, `probability_of_title_pct`, `expected_full_hours_per_day`, `expected_bill_hours_per_day`, `expected_non_billable_hours_per_day`, `expected_absence_hours_per_day`, `unpaid_absence_hours_per_day`, `standard_monthly_cost`, `unweighted_standard_cost_per_day`, `weighted_allocation_cost_pct_per_day`, `adjusted_weighted_allocation_cost_pct_per_day`, `adjusted_weighted_allocation_standard_cost_per_day`, `expected_internal_bill_hours_per_day`, `expected_bill_hours_per_title_99_pct_plus`, `expected_internal_bill_hours_per_title_99_pct_plus`, `expected_bill_hours_per_title_80_pct_plus`, `expected_internal_bill_hours_per_title_80_pct_plus`, `expected_bill_hours_per_title_50_pct_plus`, `expected_internal_bill_hours_per_title_50_pct_plus`, `expected_bill_hours_per_title_1_pct_plus`, `expected_internal_bill_hours_per_title_1_pct_plus`, `expected_bill_hours_less_expected_absence_hours_per_day`, `expected_internal_bill_hours_less_expected_absence_hours_per_day`, `hours_logged`, `billable_hours_logged`, `non_allocated_hours_logged`, `internal_hours_logged`, `qb_last_updated_ts`, `target_rate`, `account_specific_practice_hourly_rate`, `account_specific_general_hourly_rate`, `generic_practice_hourly_rate`, `generic_general_hourly_rate`, `applicable_account_hourly_rate`, `applicable_generic_hourly_rate`, `all_expected_fte_days`, `billable_expected_fte_days`, `expected_bill_amount_per_day`, `expected_bill_amount_per_hour`, `expected_billable_hours_per_day`, `expected_vs_actual_hours_variance`, `expected_bill_hours_less_expected_absence_hours_per_title_99_pct_plus`, `expected_internal_bill_hours_less_expected_absence_hours_per_title_99_pct_plus`, `expected_bill_hours_less_expected_absence_hours_per_title_80_pct_plus`, `expected_internal_bill_hours_less_expected_absence_hours_per_title_80_pct_plus`, `expected_bill_hours_less_expected_absence_hours_per_title_50_pct_plus`, `expected_internal_bill_hours_less_expected_absence_hours_per_title_50_pct_plus`, `expected_bill_hours_less_expected_absence_hours_per_title_1_pct_plus`, `expected_internal_bill_hours_less_expected_absence_hours_per_title_1_pct_plus`, `account_specific_hypothetical_forecasted_revenue_per_day`, `generic_hypothetical_forecasted_revenue_per_day`, `account_specific_rate_leakage_per_day`, `generic_rate_leakage_per_day`)
    values
        (`fact_daily_moving_forecast_key`, `opportunity_title_key`, `opportunity_title_current_key`, `forecast_dt`, `work_dt`, `dim_person_key`, `dim_person_current_key`, `dim_client_account_current_key`, `dim_project_key`, `dim_department_key`, `dim_project_current_key`, `dim_financial_account_key`, `probability_pct`, `probability_of_title_pct`, `expected_full_hours_per_day`, `expected_bill_hours_per_day`, `expected_non_billable_hours_per_day`, `expected_absence_hours_per_day`, `unpaid_absence_hours_per_day`, `standard_monthly_cost`, `unweighted_standard_cost_per_day`, `weighted_allocation_cost_pct_per_day`, `adjusted_weighted_allocation_cost_pct_per_day`, `adjusted_weighted_allocation_standard_cost_per_day`, `expected_internal_bill_hours_per_day`, `expected_bill_hours_per_title_99_pct_plus`, `expected_internal_bill_hours_per_title_99_pct_plus`, `expected_bill_hours_per_title_80_pct_plus`, `expected_internal_bill_hours_per_title_80_pct_plus`, `expected_bill_hours_per_title_50_pct_plus`, `expected_internal_bill_hours_per_title_50_pct_plus`, `expected_bill_hours_per_title_1_pct_plus`, `expected_internal_bill_hours_per_title_1_pct_plus`, `expected_bill_hours_less_expected_absence_hours_per_day`, `expected_internal_bill_hours_less_expected_absence_hours_per_day`, `hours_logged`, `billable_hours_logged`, `non_allocated_hours_logged`, `internal_hours_logged`, `qb_last_updated_ts`, `target_rate`, `account_specific_practice_hourly_rate`, `account_specific_general_hourly_rate`, `generic_practice_hourly_rate`, `generic_general_hourly_rate`, `applicable_account_hourly_rate`, `applicable_generic_hourly_rate`, `all_expected_fte_days`, `billable_expected_fte_days`, `expected_bill_amount_per_day`, `expected_bill_amount_per_hour`, `expected_billable_hours_per_day`, `expected_vs_actual_hours_variance`, `expected_bill_hours_less_expected_absence_hours_per_title_99_pct_plus`, `expected_internal_bill_hours_less_expected_absence_hours_per_title_99_pct_plus`, `expected_bill_hours_less_expected_absence_hours_per_title_80_pct_plus`, `expected_internal_bill_hours_less_expected_absence_hours_per_title_80_pct_plus`, `expected_bill_hours_less_expected_absence_hours_per_title_50_pct_plus`, `expected_internal_bill_hours_less_expected_absence_hours_per_title_50_pct_plus`, `expected_bill_hours_less_expected_absence_hours_per_title_1_pct_plus`, `expected_internal_bill_hours_less_expected_absence_hours_per_title_1_pct_plus`, `account_specific_hypothetical_forecasted_revenue_per_day`, `generic_hypothetical_forecasted_revenue_per_day`, `account_specific_rate_leakage_per_day`, `generic_rate_leakage_per_day`)

No JSON decorator prefixed. This is annoying because it means that our diagnostic queries which parse out BigQuery job logs don't properly categorize these queries. Possible to add the JSON decorator?

I had thought that this behavior was probably something controlled by dbt-core and not the individual adapters so I'm logging the issue in dbt-core but if I'm wrong please feel free to move the issue over to dbt-bigquery.

Expected Behavior

JSON decorator would show up at the beginning of merge statements like it does for other dbt-generated queries. E.g.: "app": "dbt", "dbt_version": "1.6.9", "profile_name": "user", "target_name": "prod-cloud", "node_id": "model.op_analytics.bb_tbl_custombandtitleinformation_filter"}

Steps To Reproduce

Set up a dbt model with a materialization type of incremental. If it helps, here's our configuration for the model: {{ config( partition_by={ "field": "forecast_dt", "data_type": "date", "granularity": "day" }, cluster_by = "dim_person_current_key" , materialized = 'incremental', incremental_strategy = 'insert_overwrite', partitions = partitions_to_replace, on_schema_change='fail' )}}
Run the model initially and then run another incremental run of the model
Look for the MERGE statement lacking the dbt JSON decorator

Relevant log output

No response

Environment

- OS: dbt cloud and also locally on Mac OS X
- Python:Python 3.9.6
- dbt: 1.6.9

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

Dec 09 '23 05:12 codigo-ergo-sum

Hey @codigo-ergo-sum !

We hear you about how you can't parse out the query comments from the BigQuery job labels if there's queries within the incremental materialization steps that don't include them.

I'm going to re-categorize this as a feature request and will assess this further with our friend @graciegoheen.

Dec 14 '23 17:12 dbeatty10

Thanks @dbeatty10 much appreciated!

Dec 18 '23 23:12 codigo-ergo-sum

@codigo-ergo-sum Happy new year!

It's surprising to me that you're not seeing the query comment prepended before the merge statement, although I would expect that if dbt is submitting a BQ "script" (multi-statement query), the query comment will be included once at the very beginning.

I tried reproducing, and I do see the query comment being prepended as I would expect. Here's my simplified version of the incremental model in your example:

-- models/my_bq_model.sql

{%- set partitions_to_replace = ["current_date"] -%}

{%- for i in range(1, 121) -%}
  {%- set partition_str = "date_sub(current_date, interval " ~ i ~ " day)" -%}
  {%- do partitions_to_replace.append(partition_str) -%}
{%- endfor -%}

{{ config(
  partition_by={ "field": "forecast_dt", "data_type": "date", "granularity": "day" },
  cluster_by = "dim_person_current_key",
  materialized = 'incremental',
  incremental_strategy = 'insert_overwrite',
  partitions = partitions_to_replace,
  on_schema_change='fail' )
}}

select 1 as dim_person_current_key, current_date as forecast_dt

In dbt's logs (logs/dbt.log):

12:49:28.229015 [debug] [Thread-1 (]: On model.my_dbt_project.my_bq_model: /* {"app": "dbt", "dbt_version": "1.6.9", "profile_name": "sandbox-bigquery", "target_name": "dev", "node_id": "model.my_dbt_project.my_bq_model"} */

  
    

    create or replace table `dbt-test-env`.`dbt_jcohen`.`my_bq_model__dbt_tmp`
      
    partition by forecast_dt
    cluster by dim_person_current_key

    OPTIONS(
      expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour)
    )
    as (
      

select 1 as dim_person_current_key, current_date as forecast_dt
    );
  
12:49:28.616472 [debug] [Thread-1 (]: BigQuery adapter: https://console.cloud.google.com/bigquery?project=dbt-test-env&j=bq:US:9a42bba8-f66f-454f-8ab6-fc2d839e61ec&page=queryresults
12:49:31.685346 [debug] [Thread-1 (]: 
    In `dbt-test-env`.`dbt_jcohen`.`my_bq_model`:
        Schema changed: False
        Source columns not in target: []
        Target columns not in source: []
        New column types: []
  
12:49:31.707019 [debug] [Thread-1 (]: Writing runtime sql for node "model.my_dbt_project.my_bq_model"
12:49:31.709715 [debug] [Thread-1 (]: On model.my_dbt_project.my_bq_model: /* {"app": "dbt", "dbt_version": "1.6.9", "profile_name": "sandbox-bigquery", "target_name": "dev", "node_id": "model.my_dbt_project.my_bq_model"} */

   
      

      
      -- 1. run the merge statement
      

    merge into `dbt-test-env`.`dbt_jcohen`.`my_bq_model` as DBT_INTERNAL_DEST
        using (
          select
            * from `dbt-test-env`.`dbt_jcohen`.`my_bq_model__dbt_tmp`) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and date(DBT_INTERNAL_DEST.forecast_dt) in (
              current_date, date_sub(current_date, interval 1 day), date_sub(current_date, interval 2 day), date_sub(current_date, interval 3 day), date_sub(current_date, interval 4 day), date_sub(current_date, interval 5 day), date_sub(current_date, interval 6 day), date_sub(current_date, interval 7 day), date_sub(current_date, interval 8 day), date_sub(current_date, interval 9 day), date_sub(current_date, interval 10 day), date_sub(current_date, interval 11 day), date_sub(current_date, interval 12 day), date_sub(current_date, interval 13 day), date_sub(current_date, interval 14 day), date_sub(current_date, interval 15 day), date_sub(current_date, interval 16 day), date_sub(current_date, interval 17 day), date_sub(current_date, interval 18 day), date_sub(current_date, interval 19 day), date_sub(current_date, interval 20 day), date_sub(current_date, interval 21 day), date_sub(current_date, interval 22 day), date_sub(current_date, interval 23 day), date_sub(current_date, interval 24 day), date_sub(current_date, interval 25 day), date_sub(current_date, interval 26 day), date_sub(current_date, interval 27 day), date_sub(current_date, interval 28 day), date_sub(current_date, interval 29 day), date_sub(current_date, interval 30 day), date_sub(current_date, interval 31 day), date_sub(current_date, interval 32 day), date_sub(current_date, interval 33 day), date_sub(current_date, interval 34 day), date_sub(current_date, interval 35 day), date_sub(current_date, interval 36 day), date_sub(current_date, interval 37 day), date_sub(current_date, interval 38 day), date_sub(current_date, interval 39 day), date_sub(current_date, interval 40 day), date_sub(current_date, interval 41 day), date_sub(current_date, interval 42 day), date_sub(current_date, interval 43 day), date_sub(current_date, interval 44 day), date_sub(current_date, interval 45 day), date_sub(current_date, interval 46 day), date_sub(current_date, interval 47 day), date_sub(current_date, interval 48 day), date_sub(current_date, interval 49 day), date_sub(current_date, interval 50 day), date_sub(current_date, interval 51 day), date_sub(current_date, interval 52 day), date_sub(current_date, interval 53 day), date_sub(current_date, interval 54 day), date_sub(current_date, interval 55 day), date_sub(current_date, interval 56 day), date_sub(current_date, interval 57 day), date_sub(current_date, interval 58 day), date_sub(current_date, interval 59 day), date_sub(current_date, interval 60 day), date_sub(current_date, interval 61 day), date_sub(current_date, interval 62 day), date_sub(current_date, interval 63 day), date_sub(current_date, interval 64 day), date_sub(current_date, interval 65 day), date_sub(current_date, interval 66 day), date_sub(current_date, interval 67 day), date_sub(current_date, interval 68 day), date_sub(current_date, interval 69 day), date_sub(current_date, interval 70 day), date_sub(current_date, interval 71 day), date_sub(current_date, interval 72 day), date_sub(current_date, interval 73 day), date_sub(current_date, interval 74 day), date_sub(current_date, interval 75 day), date_sub(current_date, interval 76 day), date_sub(current_date, interval 77 day), date_sub(current_date, interval 78 day), date_sub(current_date, interval 79 day), date_sub(current_date, interval 80 day), date_sub(current_date, interval 81 day), date_sub(current_date, interval 82 day), date_sub(current_date, interval 83 day), date_sub(current_date, interval 84 day), date_sub(current_date, interval 85 day), date_sub(current_date, interval 86 day), date_sub(current_date, interval 87 day), date_sub(current_date, interval 88 day), date_sub(current_date, interval 89 day), date_sub(current_date, interval 90 day), date_sub(current_date, interval 91 day), date_sub(current_date, interval 92 day), date_sub(current_date, interval 93 day), date_sub(current_date, interval 94 day), date_sub(current_date, interval 95 day), date_sub(current_date, interval 96 day), date_sub(current_date, interval 97 day), date_sub(current_date, interval 98 day), date_sub(current_date, interval 99 day), date_sub(current_date, interval 100 day), date_sub(current_date, interval 101 day), date_sub(current_date, interval 102 day), date_sub(current_date, interval 103 day), date_sub(current_date, interval 104 day), date_sub(current_date, interval 105 day), date_sub(current_date, interval 106 day), date_sub(current_date, interval 107 day), date_sub(current_date, interval 108 day), date_sub(current_date, interval 109 day), date_sub(current_date, interval 110 day), date_sub(current_date, interval 111 day), date_sub(current_date, interval 112 day), date_sub(current_date, interval 113 day), date_sub(current_date, interval 114 day), date_sub(current_date, interval 115 day), date_sub(current_date, interval 116 day), date_sub(current_date, interval 117 day), date_sub(current_date, interval 118 day), date_sub(current_date, interval 119 day), date_sub(current_date, interval 120 day)
          ) 
        then delete

    when not matched then insert
        (`dim_person_current_key`, `forecast_dt`)
    values
        (`dim_person_current_key`, `forecast_dt`)

;-- 2. clean up the temp table
      drop table if exists `dbt-test-env`.`dbt_jcohen`.`my_bq_model__dbt_tmp`;

  

    
12:49:32.017791 [debug] [Thread-1 (]: BigQuery adapter: https://console.cloud.google.com/bigquery?project=dbt-test-env&j=bq:US:01171033-01b7-45fc-bc01-e20745dd6a2c&page=queryresults

And in BigQuery's query history:

Jan 03 '24 11:01 jtcohen6

Hi @jtcohen6 Happy New Year to you as well, and thanks for the as-always in-depth response!

So I did some digging into BigQuery INFORMATION_SCHEMA and job history. It turns out that when you run a script, BigQuery creates one record in the JOBS table for the script as a whole, and it then creates a separate record for each step of the script that is run. When you look at the JOBS table it is true that the JSON query comment appears in the querytext for the record that represents the run of the script as a whole (which you can see has a statement_type of SCRIPT.)

However, the JSON query comment doesn't appear in the records which are created by BigQuery for each of the individual steps of the script which is why it didn't show up in the original MERGE statement that I posted at the beginning of this ticket.

What is even more interesting is that if you are writing queries against the JOBS table to try to calculate the costs of queries being run on BigQuery, the bytes_billed field is populated for both the "parent" record for the script as a whole, and then for each of the individual substeps. The bytes_billed field for the script parent record appears to be the sum of the bytes_billed field for each of the individual substeps, which means that if you just SUM blindly on all the records in the JOBS table you'll double-count bytes_billed for those queries which were run as a part of a script. This is probably why, in the BigQuery documentation, all of the example queries which are provided that are doing calculations like this have a line in the code specifically excluding records with a statement_type equal to SCRIPT although it is never explicitly mentioned in the documentation (that I see) as to why they are doing that.

For example, here is a script which estimates slot usage and cost for queries which excludes scripts:

SELECT
  project_id,
  job_id,
  reservation_id,
  EXTRACT(DATE FROM creation_time) AS creation_date,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) AS job_duration_seconds,
  job_type,
  user_email,
  total_bytes_billed,

  -- Average slot utilization per job is calculated by dividing total_slot_ms by the millisecond duration of the job

  SAFE_DIVIDE(job.total_slot_ms,(TIMESTAMP_DIFF(job.end_time, job.start_time, MILLISECOND))) AS job_avg_slots,
  query,

  -- Determine the max number of slots used at ANY stage in the query.
  -- The average slots might be 55. But a single stage might spike to 2000 slots.
  -- This is important to know when estimating number of slots to purchase.

  MAX(SAFE_DIVIDE(unnest_job_stages.slot_ms,unnest_job_stages.end_ms - unnest_job_stages.start_ms)) AS jobstage_max_slots,

  -- Check if there's a job that requests more units of works (slots). If so you need more slots.
  -- estimated_runnable_units = Units of work that can be scheduled immediately.
  -- Providing additional slots for these units of work accelerates the query,
  -- if no other query in the reservation needs additional slots.

  MAX(unnest_timeline.estimated_runnable_units) AS estimated_runnable_units
FROM `region-us`.INFORMATION_SCHEMA.JOBS AS job
  CROSS JOIN UNNEST(job_stages) as unnest_job_stages
  CROSS JOIN UNNEST(timeline) AS unnest_timeline
WHERE project_id = 'my_project'
  AND statement_type != 'SCRIPT'
  AND DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
GROUP BY 1,2,3,4,5,6,7,8,9,10
ORDER BY job_id;

from here: https://cloud.google.com/bigquery/docs/information-schema-jobs#estimate_slot_usage_and_cost_for_queries

It turns out that other fields are populated on the records for individual steps of scripts that are useful that are not populated on the script-level record. A good example is destination_table.table_id which is useful when you're trying to find out which target tables/objects are associated with more expensive queries. This is a reason why we need the details on the individual script steps along with the JSON query comments at the same time and can't just use for the script-level records. Also, given that we need to exclude script-level records from queries that we're using to calculate BQ job costs anyway having the JSON query comments there isn't as helpful.

I don't have a great solution for this at the moment except trying to see if there's a way to manually force the query comments to be included in the sql querytext of the individual script steps which I could poke at in the next week or so. Definitely open to your suggestions on this. I realize I'm not providing a lot of solutions here but I wanted to at least document the root cause of the issue that I'm reporting so that you could hopefully reproduce.

The other thing for our internal BQ cost analytics that I might try is some sort of window function on the "parent" script record in JOBS to get the JSON query comments from there but that's pretty hacky and fragile if there are future changes to how the JOBS table works further in the future in BQ.

Jan 03 '24 22:01 codigo-ergo-sum

@codigo-ergo-sum Thank you for the super-thorough digging on your side!

I better understand better the complexity here: each script is one job with multiple job_stages, and BigQuery's docs seem to encourage calculating usage/cost with the "job stage" as the atomic unit. I also think we're fully into the realm of "BigQuery-specific behavior," so I'm going to transfer this issue to the dbt-bigquery repo.

I see three options ahead of us, and one of them is much more appealing than the others:

Within dbt-bigquery's module for connecting & submitting queries, rather than simply prepending query comments to the start of queries — identify scripts, split them on semicolons, prepend query comments before each individual statement, recombine into a single "script" query, and submit to BigQuery. This sounds very liable to error; we only do this kind of semicolon-splitting when we absolutely have to.
Expose the query comment to the Jinja context, e.g. as {{ query_comment }}, and do our very best to remember to stick it in multiple times whenever we are templating out a multi-statement SQL query. Today, we only really use BQ scripts for the insert-overwrite incremental strategy, but it feels liable to (human) error in the future.
Encourage dbt + BigQuery users to enable "query comments as job labels" (docs). This has the effect of applying dbt's query comment as a structured array of labels on the job, and applying it to every stage of the job.

I like option (3) best! Give it a try, and let me know what you think:

# dbt_project.yml
query-comment:
  job-label: true

select
  labels.value as dbt_invocation_id,
  job.query,
  count(*) as count_job_stages,
  sum(total_bytes_billed) as total_bytes_billed
from `region-us`.INFORMATION_SCHEMA.JOBS AS job
  cross join unnest(job_stages) as unnest_job_stages
  cross join unnest(labels) as labels
where project_id = 'dbt-test-env' # replace with your project name
  and user_email = '[email protected]' # replace with your email
  and statement_type != 'SCRIPT'
  and date(creation_time) = current_date
  and labels.key = 'dbt_invocation_id'
group by 1,2

Actually, dbt applies the dbt_invocation_id label regardless, even if you don't enable query-comments-as-job-labels. But turning on that config will give you access to the same full set of structured info that's available in the query comment.

Without the opt-in config: Screenshot 2024-02-07 at 13 35 26

With the config: Screenshot 2024-02-07 at 13 35 35

Feb 07 '24 12:02 jtcohen6

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

May 08 '24 01:05 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

May 15 '24 01:05 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

May 15 '24 01:05 github-actions[bot]

dbt-bigquery dbt-bigquery copied to clipboard

[CT-3486] [Feature] Include JSON-formatted query comments for all queries within incremental models

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

dbt-bigquery
dbt-bigquery copied to clipboard