[CT-2998] [SPIKE] Handle unit testing of JSON and ARRAY data types
Description
We should enable adding unit test mock inputs and outputs that contain fields of type JSON or type ARRAY.
Example using in-line yaml dict format:
unit-tests:
- name: test_json_unpack # this is the unique name of the test
model: my_model # name of the model I'm unit testing
given: # optional: list of inputs to provide as fixtures
- input: ref('employees')
rows:
- {id: 1, empl_info: {"employee":{"name":"John", "age":30, "city":"New York"}}}
- {id: 2, empl_info: {"employee":{"name":"Doug", "age":97, "city":"Flagstaff"}}}
- input: ref('office_locations')
rows:
- {id: 1, offices: [1,3]}
- {id: 2: offices: [2,4]}
expect: # required: the expected output given the inputs above
rows:
- {employee_id: 1, name: John, age: 30, city: New York, offices: [1,3]}
- {employee_id: 2, name: Doug, age: 97, city: Flagstaff, offices: [2,4]}
...
Example using csv format:
unit-tests:
- name: test_json_unpack # this is the unique name of the test
model: my_model # name of the model I'm unit testing
given: # optional: list of inputs to provide as fixtures
- input: ref('employees')
format: csv
rows: |
id, empl_info
1,{"employee":{"name":"John", "age":30, "city":"New York"}}
2,{"employee":{"name":"Doug", "age":97, "city":"Flagstaff"}}
- input: ref('office_locations')
format: csv
rows: |
id,offices
1,[1,3]
2,[2,4]
expect: # required: the expected output given the inputs above
format: csv
rows: |
employee_id,name,age,city,offices
1,John,30,New York,[1,3]
2,Doug,97,Flagstaff,[2,4]
...
Note: we can assume we already know the data types and columns of the inputs.
Acceptance criteria
FOR SPIKE:
- Understand the technical complexities, edge cases, and how we would go about testing this
- Draft implementation
FOR IMPLEMENTATION:
- You can specify inputs or expected outputs for unit tests with columns of JSON data type (in at least one of the available
formats, ideally all) - You can specify inputs or expected outputs for unit tests with columns of ARRAY data type (in at least one of the available
formats, ideally all) - The error message for when a unit test fails should still be readable when using JSON data types.
- The error message for when a unit test fails should still be readable when using ARRAY data types.
- Bonus if it works for STRUCT to (lol, if not it's chill we'll open another issue)
Impact to other teams
Impact adapter teams
Will backports be required?
No
Context
From refinement:
- we don't allow complex data types for seeds
- we should enumerate what all the "complex" types are and what all the tests are that need to be added in acceptance criteria
- design could be similar to contracts, so if they provide us something that can be cast-ed to a user-inputed data type
- we already have an array_construct macro
Example for testing constraints - https://github.com/dbt-labs/dbt-core/blob/main/tests/adapter/dbt/tests/adapter/constraints/test_constraints.py#L70-L73
Relevant to https://github.com/dbt-labs/dbt-core/issues/8499
Example for why someone might want to unit test a model that has inputs with JSON data type:
- JSON data type in snowflake for testing JSON unpacking logic
@aranke to document edge cases (will update this ticket)
Reason this is higher priority:
- you can't currently define a unit test on a model that depends on an input with 1 or more complex data type (even if that column isn't relevant to the unit test / not defined in the mock data)
This is in addition to not being able to use input data with complex data type (the initial reason we opened the issue)
Spike Report / Update
Current State
dbt-snowflake, dbt-bigquery, and dbt-spark support mocking inputs with complex types in unit testing, including json and array types. tests here:
- dbt-snowflake: https://github.com/dbt-labs/dbt-snowflake/blob/main/tests/functional/adapter/unit_testing/test_unit_testing.py
- dbt-bigquery: https://github.com/dbt-labs/dbt-bigquery/blob/main/tests/functional/adapter/unit_testing/test_unit_testing.py
- dbt-spark: https://github.com/dbt-labs/dbt-spark/blob/main/tests/functional/adapter/unit_testing/test_unit_testing.py
dbt-postgres and dbt-redshift support json, but not arrays.
These implementations largely required very minor & precise changes to safe_cast so that it handled more input value types and conformed them as appropriate before attempting to cast to the desired type.
However, dbt-postgres and dbt-redshift will require a different approach to support complex types. This is because the strategy for obtaining the column schemas in unit testing is adapter.get_columns_in_relation. adapter.get_columns_in_relation works flawlessly for the 3 adapters above, but is lossy for dbt-postgres and dbt-redshift. For example, array types are simply 'ARRAY' and don't include the type of array which is necessary for safe casting to an appropriate type.
An alternative strategy we can use for these adapters is adapter.get_column_schema_from_query, which was implemented for model contracts and spiked for unit testing here. However, this strategy makes support for complex types for the other adapters unreasonably complex (e.g. bigquery struct type would need to be reconstructed into its 'sql' type from the obtained column schema).
So: we'll need to support both mechanisms of retrieving the column schema, as dictated by a specific adapter implementation because one strategy will work for many adapters but not others, and vice versa.
Proposal
- At the dbt-adapters level, in the
unitmaterialization: support both mechanisms for obtaining the column schema based on an adapter-dispatch macrounit_testing_column_schema_strategy, that defaults torelation. - Implement
unit_testing_column_schema_strategyto returnempty_queryin dbt-postgres and dbt-redshift
- new strategy uses: get_column_schema_from_query
- Additional safe_cast or type mapping modifications may be necessary in either adapter to get this entirely working based on the new column schema strategy
- Extend functional tests to include test cases for complex types in unit testing in dbt-postgres and dbt-redshift. Existing tests are in place for primative types:
- dbt-postgres: https://github.com/dbt-labs/dbt-postgres/blob/main/tests/functional/shared_tests/test_unit_testing.py
- dbt-redshift: https://github.com/dbt-labs/dbt-redshift/blob/main/tests/functional/adapter/unit_testing/test_unit_testing.py
New Issues
- dbt-adapters: https://github.com/dbt-labs/dbt-adapters/issues/113
- dbt-postgres: https://github.com/dbt-labs/dbt-postgres/issues/29
- dbt-redshift: https://github.com/dbt-labs/dbt-redshift/issues/726