sqlmesh icon indicating copy to clipboard operation
sqlmesh copied to clipboard

SCDType 2 tests for value_from and value_to

Open siddharth-sc1903 opened this issue 2 months ago • 3 comments

Title: Testing SCD Type 2 model fails when valid_from and valid_to are included in expected outputs

Bug Description

When writing a SQLMesh test for a model with kind: SCD_TYPE_2_BY_TIME, the test fails if the valid_from and valid_to columns are included in the outputs section. The test passes as expected if these two columns are removed from the assertion.

This makes it difficult to write tests that explicitly validate the SCD logic, as we cannot assert the expected values for the columns that the SCD model itself is supposed to generate. The issue seems to occur even when the expected values in the test should be correct for a simple new record insertion (where valid_from should equal the updated_at column and valid_to should be NULL).

Steps to Reproduce

  1. Create a simple SCD Type 2 model:

    -- file: models/product_scd.sql
    MODEL (
      name my_models.product_scd,
      kind SCD_TYPE_2_BY_TIME (
        unique_key product_id,
        updated_at_name effective_ts
      ),
      start '2024-01-01',
      cron '@daily',
      columns (
        product_id STRING,
        name STRING,
        price DECIMAL(18, 2),
        effective_ts TIMESTAMP,
        valid_from TIMESTAMP,
        valid_to TIMESTAMP
      )
    );
    
    SELECT
      product_id,
      name,
      price,
      effective_ts
    FROM
      @this_model_sources
    
  2. Create a test file for the model:

    # file: tests/test_product_scd.yaml
    test_new_product_insertion:
      model: my_models.product_scd
      inputs:
        my_models.product_scd__source:
          - product_id: "p1"
            name: "Laptop"
            price: 1200.00
            effective_ts: "2024-05-01 10:00:00"
    
      # THIS TEST FAILS
      outputs:
        query: |
          SELECT
            'p1' AS product_id,
            'Laptop' AS name,
            CAST(1200.00 AS DECIMAL(18, 2)) AS price,
            CAST('2024-05-01 10:00:00' AS TIMESTAMP) AS effective_ts,
            -- The test fails when the following two lines are included
            CAST('2024-05-01 10:00:00' AS TIMESTAMP) AS valid_from,
            CAST(NULL AS TIMESTAMP) AS valid_to
    
    # If you remove valid_from and valid_to from the outputs query,
    # the test passes without issue. Like this:
    #
    # outputs:
    #   query: |
    #     SELECT
    #       'p1' AS product_id,
    #       'Laptop' AS name,
    #       CAST(1200.00 AS DECIMAL(18, 2)) AS price,
    #       CAST('2024-05-01 10:00:00' AS TIMESTAMP) AS effective_ts
    
  3. Run the test:

    sqlmesh test tests/test_product_scd.yaml
    

Expected Behavior

The test test_new_product_insertion should pass. For a new record being inserted into an SCD model, the valid_from timestamp should be populated from the effective_ts column, and the valid_to timestamp should be NULL. The test's outputs section correctly reflects this expected state.

Actual Behavior

The test fails with a diff, indicating that the actual values for valid_from and/or valid_to generated by the model run do not match the expected values defined in the test case.

Environment

  • SQLMesh version: 0.224.0
  • Dialect: Databricks

siddharth-sc1903 avatar Oct 20 '25 19:10 siddharth-sc1903

I suspect that this happens because of the column name overlap with the default valid_from and valid_to column names configured as part of the SCD2 model kind. These columns are managed by SQLMesh and are not considered to be a part of the model.

Have you tried using different names or setting the valid_from_name and valid_to_name to different values inside the SCD_TYPE_2_BY_TIME kind?

izeigerman avatar Oct 23 '25 16:10 izeigerman

@izeigerman yep i did try setting a difffernt valid_from . it didnt seem to like it and was complining of missing/unexpected rows

siddharth-sc1903 avatar Oct 23 '25 16:10 siddharth-sc1903

-- A simple SCD Type 2 model for testing purposes.
MODEL (
  name dev.scd_example,
  kind SCD_TYPE_2_BY_TIME (
    unique_key item_id,
    batch_size 1,
    valid_from_name my_valid_from,
    valid_to_name my_valid_to
  ),
  start '2025-01-01',
  cron '@daily'
);

-- The source is configured to be the test's input directly.
-- Filter by execution interval so each batch only sees changes for that day
SELECT
  item_id,
  status,
  updated_at
FROM dev.scd_example_source
WHERE updated_at >= @start_dt 
  AND updated_at < @end_dt;

test_scd_example:
  model: '"db"."dev"."scd_example"'
  vars:
    start: 2025-01-01
    end: 2025-01-10
  inputs:
    '"db"."dev"."scd_example_source"':
        format: csv
        path: seeds/scd_source_data.csv
  outputs:
    query:
      query: |
        -- Row 0: Item 1, active (initial state, closed)
        SELECT '1' AS item_id, 'active' AS status, CAST('2025-01-01 00:00:00' AS TIMESTAMP) AS updated_at, CAST('1970-01-01 00:00:00' AS TIMESTAMP) AS my_valid_from, CAST('2025-01-07 00:00:00' AS TIMESTAMP) AS my_valid_to
        UNION ALL
        -- Row 1: Item 1, active (current state, open)
        SELECT '1' AS item_id, 'active' AS status, CAST('2025-01-08 00:00:00' AS TIMESTAMP) AS updated_at, CAST('2025-01-08 00:00:00' AS TIMESTAMP) AS my_valid_from, NULL AS my_valid_to
        UNION ALL
        -- Row 2: Item 1, inactive (changed state, closed)
        SELECT '1' AS item_id, 'inactive' AS status, CAST('2025-01-07 00:00:00' AS TIMESTAMP) AS updated_at, CAST('2025-01-07 00:00:00' AS TIMESTAMP) AS my_valid_from, CAST('2025-01-08 00:00:00' AS TIMESTAMP) AS my_valid_to
        UNION ALL
        -- Row 3: Item 2, active (initial state, closed)
        SELECT '2' AS item_id, 'active' AS status, CAST('2025-01-02 00:00:00' AS TIMESTAMP) AS updated_at, CAST('1970-01-01 00:00:00' AS TIMESTAMP) AS my_valid_from, CAST('2025-01-05 00:00:00' AS TIMESTAMP) AS my_valid_to
        UNION ALL
        -- Row 4: Item 2, inactive (current state, open)
        SELECT '2' AS item_id, 'inactive' AS status, CAST('2025-01-05 00:00:00' AS TIMESTAMP) AS updated_at, CAST('2025-01-05 00:00:00' AS TIMESTAMP) AS my_valid_from, NULL AS my_valid_to
        UNION ALL
        -- Row 5: Item 3, active (initial state, closed)
        SELECT '3' AS item_id, 'active' AS status, CAST('2025-01-03 00:00:00' AS TIMESTAMP) AS updated_at, CAST('1970-01-01 00:00:00' AS TIMESTAMP) AS my_valid_from, CAST('2025-01-06 00:00:00' AS TIMESTAMP) AS my_valid_to
        UNION ALL
        -- Row 6: Item 3, inactive (current state, open)
        SELECT '3' AS item_id, 'inactive' AS status, CAST('2025-01-06 00:00:00' AS TIMESTAMP) AS updated_at, CAST('2025-01-06 00:00:00' AS TIMESTAMP) AS my_valid_from, NULL AS my_valid_to
Image

siddharth-sc1903 avatar Oct 23 '25 16:10 siddharth-sc1903