sqlmesh icon indicating copy to clipboard operation
sqlmesh copied to clipboard

Unexpected behaviour of INCREMENTAL_BY_UNIQUE_KEY downstream of a FULL model

Open florianm opened this issue 4 months ago • 1 comments

Scenario

I want to import data, run some QA on it, then repeat the import until I accept QA. My data refresh frequency is quarterly, so I have plenty of time to review and QA. I plan to do this by running sqlmesh plan dev to preview the fresh data across all my model layers, then run sqlmesh plan to accept the entire batch of fresh data. I want to retain every accepted version of my data so I can offer time travel to downstream analysis to data far in the past.

I would prefer to only propagate accepted versions of the data, as the bulk of rejected versions of data is costly to store.

Config:

  • DuckDB/ DuckLake with DuckLake metadata and SQLMesh state in a Postgres DB.

Model 1:

  • A FULL model reading a CSV.
  • The CSV has a compound PK across two columns (['dataset_name', 'enc_root']).
  • The signal returns TRUE if the CSV was modified later than the MAX(imported_on_ts) in the model data if any data is present, else if modified later than the last possible cron run.
MODEL (
  name peoplewa.bronze.lnk_linkage_map,
  kind FULL,
  grains ['dataset_name', 'enc_root'],
  cron '*/5 * * * *', -- will be daily in PROD
  signals [
    EXT_FILE_UPDATED(
        file_path := 'plans/bronze/seeds/lnk/linkage_map.csv',
        cron_str := '*/5 * * * *',
        model_name := peoplewa.bronze.lnk_linkage_map
    )
  ]
);

SELECT
  *,
  @execution_ts::TIMESTAMPTZ AS imported_on_ts
FROM READ_CSV('plans/bronze/seeds/lnk/linkage_map.csv', delim = '\t', header = TRUE)

Model 2:

  • A second model wants to append records from model 1.
  • Since unique keys contain the imported_on_ts, their combination will always be different from existing records on each new run.
/* A SQL model to promote data from bronze to silver. */
MODEL (
  name peoplewa.silver.lnk_linkage_map,
  kind INCREMENTAL_BY_UNIQUE_KEY (
    unique_key ('imported_on_ts', 'dataset_name', 'enc_root')
  ),
  cron '*/5 * * * *',
  grains ['imported_on_ts', 'dataset_name', 'enc_root']
);

SELECT
  *
FROM peoplewa.bronze.lnk_linkage_map

Expected behaviour

I want Model 2 to append new records from Model 1. I want Model 2 to never delete records, even if they are not contained in Model 1.

Observed behaviour

Model 2 drops any rows not in Model 1. This results in Model 1 rows virtually replacing rows of past imported_on_ts timestamps.

Is this expected bahaviour? If so, a mention in the docs about deleting rows removed from upstream would be very handy.

Could this be a bug? If so, I'll reproduce in an example repo.

Possible problems

I observe correct behaviour of INCREMENTAL_BY_UNIQUE_KEY models if they are downstream of models with a standard SQL query. I could imagine that my unexpected behaviour might be particular to any of:

  • The FULL model reads from a CSV.
  • The macro @execution_ts is part of the unique key.
  • In my example, I read the same CSV again and again, and only @execution_ts changes between runs.

This feels like an execution order conflict, where SQLMesh serialisation may not be able to detect a that a changed macro variable means "new data, please append as new records" instead of "existing data, please update existing records".

florianm avatar Aug 19 '25 04:08 florianm

I have not (attempted to) reproduce this issue, but from my understanding of sqlmesh, this is not expected behavior. The docs for INCREMENTAL_BY_UNIQUE_KEY say:

If a key is present in the model table but not present in the newly loaded data, its row is not modified and remains in the model table

michiel-de-muynck avatar Aug 27 '25 10:08 michiel-de-muynck