dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

[Feature] Make `sql_header` configuration available on tests

Open epapineau opened this issue 1 year ago • 11 comments

Is this your first time submitting a feature request?

  • [X] I have read the expectations for open source contributors
  • [X] I have searched the existing issues, and I could not find an existing issue for this feature
  • [X] I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

The sql_header model configuration is a convenient way to define a temporary BigQuery UDF, as described in the dbt docs. It would be helpful to have this configuration available on tests as well for the same purposes.

Describe alternatives you've considered

No response

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

epapineau avatar Mar 19 '24 01:03 epapineau

@epapineau ! So good to see you 😄

Are you thinking primarily data tests or unit tests or both? If data tests is included, singular or generic or both?

Could you share a simplified code example in which you'd want to define a temporary BigQuery UDF and a test that would use it?

dbeatty10 avatar Mar 19 '24 15:03 dbeatty10

Are you thinking primarily data tests or unit tests or both? If data tests is included, singular or generic or both?

Data tests! And probably both singular and generic - can't imagine having the ability in one and not the other

epapineau avatar Mar 19 '24 18:03 epapineau

Could you share a simplified code example in which you'd want to define a temporary BigQuery UDF and a test that would use it?

I've been working with the temporary UDF recommended here this week. I'm creating a temporary UDF using sql_headers in my model, but I'd like to be able to test the key values in a singular data test to confirm they can successfully cast to integers. However, I'm not able to create the temporary UDF in the test like I am in the model. LMK if you want something more explicit here 👍

epapineau avatar Mar 19 '24 18:03 epapineau

LMK if you want something more explicit here 👍

Something more explicit would be great! It would serve as a two-fer-one:

  1. It will help us more tangibly consider this proposal 🧠
  2. If we end up adopting this proposal, then we can drop a variation of the explicit example into pytest to make sure it works as expected 💪

dbeatty10 avatar Mar 19 '24 19:03 dbeatty10

Okay, here's an example using the UDF from the article linked above to get keys and values from JSON. In this example, I want the test to fail where test is null. I currently can't define a temp UDF in a test w/o sql_header config on a test node.

{%- call set_sql_header(config) -%}
    CREATE TEMP FUNCTION EXTRACT_KV_PAIRS(json_str STRING)
    RETURNS ARRAY<STRUCT<key STRING, value STRING>>
    LANGUAGE js AS """
    try{ 
        const json_dict = JSON.parse(json_str); 
        const all_kv = Object.entries(json_dict).map(
            (r)=>Object.fromEntries([["key", r[0]],["value",  
                                    JSON.stringify(r[1])]]));
        return all_kv;
    } catch(e) { return [{"key": "error","value": e}];}
    """;
{%- endcall -%}

-- sample JSON requiring a UDF to extract keys
WITH
sample_json AS(
  SELECT '{"1":{"status":"success"},"2":{"status":"success"},"NaN":{"status":"unsuccess"}}' AS payload
),
-- use UDF
extracted_pairs as (
  SELECT EXTRACT_KV_PAIRS(payload) as kv_list FROM sample_json
)
SELECT
    items.key as id,
    SAFE_CAST(items.key AS INTEGER) as test  -- Test fails for rows where test is null
FROM extracted_pairs
CROSS JOIN UNNEST(kv_list) as items

epapineau avatar Mar 20 '24 01:03 epapineau

Thanks for that example @epapineau !

I tried out your example, and surprising it works if (and only if) store_failures_as="table" (or store_failures=true).

Regardless, it makes sense for us to do the following:

  • Enable sql_header to apply to both singular and generic data tests regardless if they are storing failures or not.

I did a proof-of-concept in these draft PRs:

  • https://github.com/dbt-labs/dbt-adapters/pull/145
  • https://github.com/dbt-labs/dbt-core/pull/9854

Labeling this as help_wanted for someone in the community to bring across the finish line (adding test cases, etc.).

dbeatty10 avatar Apr 04 '24 21:04 dbeatty10

@dbeatty10 hi! Could i take your draft PRs? I've worked on good_first_issue issue in dbt-core, but I didn't contribute to dbt-adapter.

  • I'm wondering if I could do it,
  • and if so, is it okay to do on ref branch of draft-PR (e.g. dbeatty/9775-sql-header)?

jx2lee avatar May 23 '24 14:05 jx2lee

@dbeatty10 sql_header is not working on unit tests too, maybe we could take care of that one as well? Would https://github.com/dbt-labs/dbt-adapters/pull/145 changes also affect them? I would need to rework my local setup to test if you don't know 👀

github-christophe-oudar avatar May 27 '24 09:05 github-christophe-oudar

There is a patch for this at https://takemikami.com/2024/0921-dbt-bigquery-sql-header.html. It worked well when I tried it. I wonder what it would take to get it into the release.

szhou21 avatar Oct 28 '24 21:10 szhou21

I have opened 2 PRs to fix the issue based on @takemikami blogpost (I just tweaked it to make it work with up to date version):

  • https://github.com/dbt-labs/dbt-core/pull/11386
  • https://github.com/dbt-labs/dbt-adapters/pull/914

I used it locally and it works as intended.

@dbeatty10, how do you feel about that approach? I think it might not be the best approach nor the more generic one (it's just fixing it for BigQuery). The first PR has been closed and the second is still in draft. Was the solution working as intended? The approach from @takemikami would fix the unit tests but probably not for the data tests. So I wonder if we should try do it that way? Or maybe a mix of this approach and yours? I could add some tests if we agree on the way to do it. Thanks!

Kayrnt avatar Mar 28 '25 21:03 Kayrnt

I just wanted to +1 this issue, seeing that the PRs appear to have just stalled. My own use case are UDFs for generating UUIDs. Where this feature is needed to actually test generation in models is correct as we have our own namespaces in actual code.

{% macro uuid_functions() %}

    {% call set_sql_header(config) %}
-- helper to format a 32-hex string as a standard UUID
CREATE TEMP FUNCTION 
FORMAT_UUID(hex32 STRING)
RETURNS STRING AS (
  CONCAT(
    SUBSTR(hex32,  1,  8), '-',
    SUBSTR(hex32,  9,  4), '-',
    SUBSTR(hex32, 13,  4), '-',
    SUBSTR(hex32, 17,  4), '-',
    SUBSTR(hex32, 21, 12)
  )
);

-- Generate a UUID v5 from input string and a namespace BYTES
CREATE TEMP FUNCTION GENERATE_UUID_V5_BYTES(namespace_bytes BYTES, name STRING)
RETURNS STRING AS (
  FORMAT_UUID(
    -- take the first 16 bytes of SHA1(namespace || name)
    TO_HEX(
      -- apply RFC-4122 masks: clear version+variant bits, then set version=5 & variant=10xx
      (
        CAST(
          TO_HEX(LEFT(SHA1(CONCAT(namespace_bytes,
                              CAST(name AS BYTES FORMAT 'UTF8'))), 16))
        AS BYTES FORMAT 'HEX'
        )
        & CAST('ffffffffffff0fff3fffffffffffffff' AS BYTES FORMAT 'HEX')
      )
      | CAST('00000000000050008000000000000000' AS BYTES FORMAT 'HEX')
    )
  )
);

CREATE TEMP FUNCTION GENERATE_UUID_V5(namespace_alias STRING, name STRING)
RETURNS STRING AS (
  generate_uuid_v5_bytes(
    CASE LOWER(namespace_alias)
      WHEN 'dns'    THEN FROM_HEX(REPLACE('6ba7b810-9dad-11d1-80b4-00c04fd430c8', '-', ''))
      WHEN 'url'      THEN FROM_HEX(REPLACE('6ba7b811-9dad-11d1-80b4-00c04fd430c8', '-', ''))
      WHEN 'oid'     THEN FROM_HEX(REPLACE('6ba7b812-9dad-11d1-80b4-00c04fd430c8', '-', ''))
      WHEN 'x500'  THEN FROM_HEX(REPLACE('6ba7b814-9dad-11d1-80b4-00c04fd430c8', '-', ''))
      ELSE NULL
    END,
    name
  )
);
    {% endcall %}

{% endmacro %}

{% macro generate_uuid5(namespace, name) %}
    generate_uuid_v5('{{ namespace }}', {{ name }})
{% endmacro %}

MattOates avatar Jun 02 '25 15:06 MattOates