dagster
dagster copied to clipboard
AutoMaterializePolicy Skip If Already Running
What's the use case?
We have non-partitioned assets that currently utilize an incremental practice - this means that sometimes the job runs for quite a while, (especially if historical data is being loaded in).
This means that the auto materialization policy might start multiple jobs (we are currently backfilling huge quantities so it's running for quite a while), and suddenly our 10 job slots are filled by the same slot.
Similarly, a non-partitioned job doesn't make sense to run when it's already running. However, it doesn't seem to be possible to configure the MaterializationPolicy for this specific thing.
Alternatively, I also wanted to use max_concurrent_runs to simply specify that this asset cannot run more than one at a time, however, it's currently not possible to set concurrency limits on an asset specifically (only on jobs or similar.. sadly).
I would argue, however, that it'd probably almost always be the case that one doesn't want to have the asset auto-materialize the same partition (or asset, if not partitioned) if an auto-materialization is already running on that specific asset.
Ideas of implementation
I assume this is non-trivial, as one has to look into the state of runs; nevertheless, there is the ability to skip runs if a backfill is running; I assume there might be some synergy between the two.. maybe?
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
Would love for this to be prioritized, or a temporary fix proposed. It sorta results in bugs/annoyance with duplicate runs and wasted resources.
@C0DK looks like the skip_on_run_in_progress rule is available in the latest release
I want to reopne this, as all our assets in question are partitioned.
if context.partitions_def is not None:
raise DagsterInvariantViolationError(
"SkipOnRunInProgressRule is currently only support for non-partitioned assets."
)
however the feature is nifty never the less
Just to add; our eager auto materialization started the same partition a lot of times, as we have a partitioned asset that relies on a non-partitioned asset, that updates rather quickly. We can filter that away as 'dependencies' or something, but we want them to update; just not instantly.
Just a clear example. The last N materializations are on the same partition, because our queue is full, and therefore the partition isn't updated before the next tick, where the materialization is still missing, however multiple runs have been requested.
I was thinking whether one could simply remove the raise in the code? wouldn't that basically just mean that one cannot auto materialize ANY partition while ANY partition is in progress. i know it's not ideal, but it's 'better', and will just mean a delay of any subsequent partition updates.
To make it very explicit one could require the usage of an all_partitions=True argument. I've created a (potential) solution here: https://github.com/dagster-io/dagster/pull/21415 - would love feedback.