dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

[CT-3033] [Spike] Explore support multiple unit test materialization strategies: CTE vs 'seed'-based

Open MichelleArk opened this issue 2 years ago • 1 comments

From the discussion thread: https://github.com/dbt-labs/dbt-core/discussions/8275#discussioncomment-6826176

There are two main high-level implementation approaches for unit testing in dbt:

  1. inlining fixtures/expected ouputs as CTE in a single unit test query (to produce the 'actual' result) or,
  2. persisting all input fixtures + expected given outputs, and querying the result of the model SQL run against the persisted input fixtures. Once the unit test finishes, cleanup any persisted fixtures from the warehouse.

I think both are technically feasible and would actually have pretty similar implementations under the hood: either using a materialization that leverages existing ephemeral logic for the 'CTE trickery' route, or actually materializing inputs and the 'actual' test model in the warehouse using the existing seed materialization.

Tradeoffs: Actually materializing the input/actual datasets is a more accurate representation of how the models are run in production in comparison to the CTE-based approach, and would support a larger set of SQL/dbt functionality than CTEs. For example, syntax that is used sql_headers that may not be valid in a standalone query, or certain types in that can be inserted but not actually declared in a standalone query (https://github.com/dbt-labs/dbt-project-evaluator/issues/290). Do any other limitations come to mind? The tradeoff being performance: actually materializing fixtures/expected/actual in the warehouse, querying them to obtain a diff, and deleting them reliably at the end of the test run all add up to additional latency.

Next steps So far we've started with the CTE approach, mostly for sake of simplicity, but I do believe it'd be very worthwhile to spike the seed-based approach and quantify more precisely how much slower/complex that approach would be. @gshank also suggested exploring implementing both strategies and either selecting the strategy based on user configuration or the presence of certain conditions (e.g. a sql_header, or particular type on the model being tested). I think a non-CTE stategy would also be necessary to test complex or custom materializations end-to-end (https://github.com/dbt-labs/dbt-core/discussions/8275#discussioncomment-6826145) .

Let's implement the seed-based strategy in a spike to understand:

  1. its technical complexity
  2. whether/how it deviates from the CTE-based approach
  3. performance implications in relation to the CTE-based approach (with an remote adapter rather than dbt-postgres)

Ultimately let's use those learnings to recommend whether we should implement unit tests with:

  1. just the CTE-based approach (as it is currently)
  2. just the seed-based approach
  3. a combination of both - with a top-level strategy pattern that toggles between the two under appropriate conditions or user configuration.

MichelleArk avatar Aug 25 '23 19:08 MichelleArk

this may roll to the next sprint

martynydbt avatar Nov 01 '23 15:11 martynydbt

Another limitation of the CTE approach is mentioned in https://github.com/dbt-labs/dbt-redshift/issues/807, so it does not support the rather common aggregation functions. I would vote for the alternate solution.

Auric-Manteo avatar Jul 08 '24 06:07 Auric-Manteo