dagster icon indicating copy to clipboard operation
dagster copied to clipboard

Skip Materialization When Parent Data Versions Have Not Changed

Open clintonmonk opened this issue 1 year ago • 3 comments

What's the use case?

Our data pipeline ingests data from multiple data sources. Each data source has its own DAG of assets. The first assets ("raw") represent the data from the data source. They are materialized on a schedule that is chosen based on how often the data source updates. Downstream assets in the DAG ("cleaned", etc) are auto-materialized when their parent assets are materialized. We define data versions and code versions (docs) for each of the assets.

We have found that, in these DAGs, we only want to auto-materialize when there is a data change (the data version of the parents has changed), not when there is a code change (the code version of the parents or current asset has changed). This is because we share code libraries in the downstream assets (e.g. a shared cleaning library). We manually materialize assets when a code change requires re-materialization. Otherwise, we wait for the next scheduled run.

We currently support this behavior inside the op (specifically, inside the @multi_asset) by using the OpExecutionContext to check if the parents' data versions have changed since the last time the assets were materialized. If they have not changed, the op is skipped (docs). The downside to this approach is that the op still runs, which affects the materialization history of the assets. It would be nice if we could configure the appropriate auto-materialization policy so that the op would not run at all.

Ideas of implementation

This could be supported with a new or updated AutoMaterializeRule (docs).

Ideas:

  1. A new AutoMaterializeRule.skip_on_no_parent_data_update (name TBD). This rule would skip if none of the parents have new data versions since the last time the asset was materialized. Users could add this to their eager() policy.
  2. A new AutoMaterializeRule.materialize_on_parent_data_updated (name TBD). This rule would be nearly identical to the existing AutoMaterializeRule.materialize_on_parent_updated except that it would only materialize if the parents' data versions were updated. Users would replace materialize_on_parent_updated with this rule in their eager() policy.
  3. Update existing AutoMaterializeRule.materialize_on_parent_updated to accept an optional parameter to work with this use case. For example, AutoMaterializeRule.materialize_on_parent_updated(ignore_code_version_updates=True) would ignore code version updates so that materialization would only happen for data version updates. Users would replace materialize_on_parent_updated with this rule in their eager() policy.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

clintonmonk avatar Sep 28 '23 13:09 clintonmonk