dbt-core [CT-2998] [SPIKE] Handle unit testing of JSON and ARRAY data types

Description

We should enable adding unit test mock inputs and outputs that contain fields of type JSON or type ARRAY.

Example using in-line yaml dict format:

unit-tests:
  - name: test_json_unpack # this is the unique name of the test
    model: my_model # name of the model I'm unit testing
    given: # optional: list of inputs to provide as fixtures
      - input: ref('employees')
        rows:
         - {id: 1, empl_info: {"employee":{"name":"John", "age":30, "city":"New York"}}}
         - {id: 2, empl_info: {"employee":{"name":"Doug", "age":97, "city":"Flagstaff"}}}
      - input: ref('office_locations')
        rows:
         - {id: 1, offices: [1,3]}
         - {id: 2: offices: [2,4]}
    expect: # required: the expected output given the inputs above
      rows:
      - {employee_id: 1, name: John, age: 30, city: New York, offices: [1,3]}
      - {employee_id: 2, name: Doug, age: 97, city: Flagstaff, offices: [2,4]}
    ...

Example using csv format:

unit-tests:
  - name: test_json_unpack # this is the unique name of the test
    model: my_model # name of the model I'm unit testing
    given: # optional: list of inputs to provide as fixtures
      - input: ref('employees')
        format: csv
        rows:  |
        id, empl_info
        1,{"employee":{"name":"John", "age":30, "city":"New York"}}
        2,{"employee":{"name":"Doug", "age":97, "city":"Flagstaff"}}
      - input: ref('office_locations')
        format: csv 
        rows: |
        id,offices
        1,[1,3]
        2,[2,4]
    expect: # required: the expected output given the inputs above
      format: csv
      rows: |
      employee_id,name,age,city,offices
      1,John,30,New York,[1,3]
      2,Doug,97,Flagstaff,[2,4]
    ...

Note: we can assume we already know the data types and columns of the inputs.

Acceptance criteria

FOR SPIKE:

Understand the technical complexities, edge cases, and how we would go about testing this
Draft implementation

FOR IMPLEMENTATION:

You can specify inputs or expected outputs for unit tests with columns of JSON data type (in at least one of the available formats, ideally all)
You can specify inputs or expected outputs for unit tests with columns of ARRAY data type (in at least one of the available formats, ideally all)
The error message for when a unit test fails should still be readable when using JSON data types.
The error message for when a unit test fails should still be readable when using ARRAY data types.
Bonus if it works for STRUCT to (lol, if not it's chill we'll open another issue)

Impact to other teams

Impact adapter teams

Will backports be required?

No

Context

Aug 16 '23 14:08 gshank

From refinement:

we don't allow complex data types for seeds
we should enumerate what all the "complex" types are and what all the tests are that need to be added in acceptance criteria
design could be similar to contracts, so if they provide us something that can be cast-ed to a user-inputed data type
we already have an array_construct macro

Example for testing constraints - https://github.com/dbt-labs/dbt-core/blob/main/tests/adapter/dbt/tests/adapter/constraints/test_constraints.py#L70-L73

Sep 18 '23 15:09 graciegoheen

Relevant to https://github.com/dbt-labs/dbt-core/issues/8499

Oct 02 '23 19:10 graciegoheen

Example for why someone might want to unit test a model that has inputs with JSON data type:

JSON data type in snowflake for testing JSON unpacking logic

Oct 06 '23 18:10 graciegoheen

@aranke to document edge cases (will update this ticket)

Oct 31 '23 15:10 aranke

Reason this is higher priority:

you can't currently define a unit test on a model that depends on an input with 1 or more complex data type (even if that column isn't relevant to the unit test / not defined in the mock data)

This is in addition to not being able to use input data with complex data type (the initial reason we opened the issue)

Jan 22 '24 20:01 graciegoheen

Spike Report / Update

Current State

dbt-snowflake, dbt-bigquery, and dbt-spark support mocking inputs with complex types in unit testing, including json and array types. tests here:

dbt-snowflake: https://github.com/dbt-labs/dbt-snowflake/blob/main/tests/functional/adapter/unit_testing/test_unit_testing.py
dbt-bigquery: https://github.com/dbt-labs/dbt-bigquery/blob/main/tests/functional/adapter/unit_testing/test_unit_testing.py
dbt-spark: https://github.com/dbt-labs/dbt-spark/blob/main/tests/functional/adapter/unit_testing/test_unit_testing.py

dbt-postgres and dbt-redshift support json, but not arrays.

These implementations largely required very minor & precise changes to safe_cast so that it handled more input value types and conformed them as appropriate before attempting to cast to the desired type.

However, dbt-postgres and dbt-redshift will require a different approach to support complex types. This is because the strategy for obtaining the column schemas in unit testing is adapter.get_columns_in_relation. adapter.get_columns_in_relation works flawlessly for the 3 adapters above, but is lossy for dbt-postgres and dbt-redshift. For example, array types are simply 'ARRAY' and don't include the type of array which is necessary for safe casting to an appropriate type.

An alternative strategy we can use for these adapters is adapter.get_column_schema_from_query, which was implemented for model contracts and spiked for unit testing here. However, this strategy makes support for complex types for the other adapters unreasonably complex (e.g. bigquery struct type would need to be reconstructed into its 'sql' type from the obtained column schema).

So: we'll need to support both mechanisms of retrieving the column schema, as dictated by a specific adapter implementation because one strategy will work for many adapters but not others, and vice versa.

Proposal

At the dbt-adapters level, in the unit materialization: support both mechanisms for obtaining the column schema based on an adapter-dispatch macro unit_testing_column_schema_strategy, that defaults to relation.
Implement unit_testing_column_schema_strategy to return empty_query in dbt-postgres and dbt-redshift

new strategy uses: get_column_schema_from_query
Additional safe_cast or type mapping modifications may be necessary in either adapter to get this entirely working based on the new column schema strategy

Extend functional tests to include test cases for complex types in unit testing in dbt-postgres and dbt-redshift. Existing tests are in place for primative types:
- dbt-postgres: https://github.com/dbt-labs/dbt-postgres/blob/main/tests/functional/shared_tests/test_unit_testing.py
- dbt-redshift: https://github.com/dbt-labs/dbt-redshift/blob/main/tests/functional/adapter/unit_testing/test_unit_testing.py

New Issues

dbt-adapters: https://github.com/dbt-labs/dbt-adapters/issues/113
dbt-postgres: https://github.com/dbt-labs/dbt-postgres/issues/29
dbt-redshift: https://github.com/dbt-labs/dbt-redshift/issues/726

Feb 28 '24 02:02 MichelleArk