Intermittent schedule duplication causing double flow runs
Bug summary
Description
We are experiencing intermittent schedule duplication issues where deployments randomly create duplicate scheduled runs with identical expected start times.
Observed Behavior
- Scheduled runs are duplicated with the same expected_start_time
- Issue occurs sporadically across different deployments
- Multiple flow runs are triggered for the same schedule time
- Problem appears to resolve temporarily when schedules are toggled (deactivated/reactivated)
Version info
Prefect Server: v3.4.15
Prefect Workers: v3.4.15
Kubernetes deployment using Helm charts
Additional context
Suspected Triggers
- System actions or maintenance operations on kube
- Flow redeployment processes
- Potential race conditions during schedule updates
Current Workaround
We've implemented a cleanup script that detects and fixes duplicate schedules by toggling the automation state
deployments = await client.read_deployments(
deployment_filter=DeploymentFilter(tags=DeploymentFilterTags(all_=["prd"]))
)
logger = get_run_logger()
for d in deployments:
logger.info(f"Checking deployment: {d.name} deployment id: {d.id} flow id: {d.flow_id}")
scheduled_runs = await client.get_scheduled_flow_runs_for_deployments([d.id])
scheduled_run_times = []
has_double_schedule = False
for i in scheduled_runs:
if i.expected_start_time in scheduled_run_times:
has_double_schedule = True
logger.info(f"Double schedule detected for {d.name} at {i.expected_start_time}")
break
else:
scheduled_run_times.append(i.expected_start_time)
if has_double_schedule:
schedules = await client.read_deployment_schedules(d.id)
for schedule in schedules:
print(f"Restarting schedule: {schedule.id}")
await client.update_deployment_schedule(d.id, schedule.id, active=False)
await client.update_deployment_schedule(d.id, schedule.id, active=True)
Impact
- Unnecessary resource consumption from duplicate flow runs
- Potential data processing inconsistencies
- Manual intervention required to clean up duplicates
- Operational overhead from monitoring and cleanup processes
Questions
- Root Cause Analysis: What could be causing schedule duplication in Prefect 3.4.15?
- Flow Redeployment: Are there known issues with schedule duplication during deployment updates?
- Database Consistency: Could this be related to PostgreSQL transaction handling or race conditions?
- Prevention: Are there configuration settings or best practices to prevent schedule duplication?
- Detection: Is there a built-in mechanism to detect and prevent duplicate schedules?
Just to note that I've been experiencing a similar issue: sporadic duplication of schedules, with no consistent way of reproducing the issue, since at least Prefect v3.4.11.
We reported the issue via [email protected] but the team has struggled to reproduce the error (as we have too!).
We've found the same solution of toggling the schedule on and off as a workaround.
Unlike the OP, we are using Prefect Cloud, and deploying onto long-lived docker containers. Despite this difference, the problem we're experiencing appears the same. We have also been suspecting that a race condition on the schedule creation could be the root-cause.
hi @Spichon and @jwalton3141 - thanks for the reports!
@jwalton3141 do you mind explaining this?
deploying onto long-lived docker containers
do you mean you're running prefect worker start on a long-lived container? if so, what type of worker? not sure that'd it'd matter as far as the issue here, just curious as we work on a reliable reproduction
this is a bug that has appeared before, I'll try to track that down for reference
we'll look into this!
EDIT: possibly related
- https://github.com/PrefectHQ/prefect/issues/17281
- https://github.com/PrefectHQ/prefect/issues/17538#issuecomment-3141012479
- https://github.com/PrefectHQ/prefect/issues/18523
Thanks @zzstoatzz!
@jwalton3141 do you mind explaining this?
We're doing something similar to what is outlined in the docs here.
Unlike the example in the docs, we're using the prefect.serve() function, not the .deploy() or .serve() methods, to create our deployments.
We're constructing our schedules using prefect.schedules.Cron(), as we want to set the timezone on the schedules.
Here's an outline of what we run in our containers:
flow = prefect.flow.from_source(...)
flow_deployment = flow.to_deployment(
name="my-deployment",
schedules=[prefect.schedules.Cron("0 5 * * Mon-Fri", timezone="Europe/London")],
concurrency_limit=1,
...
)
prefect.serve(flow_deployment)
In reality, we build a list of deployments and serve them with a single prefect.serve() call.
Additionally, we actually run two identical long-lived Docker containers in a single Azure container group, including the same prefect.serve() calls. We were able to use two containers in a single group without issue in v1, however, I did wonder whether it could be causing the scheduled run duplication we now sporadically experience in v3.
Edit: just to include a screenshot I took previously when we experienced this issue (this was back on v3.4.11, but we have experienced it a couple of times since, on newer Prefect versions).
great! thanks for the detail @jwalton3141
i still think this will have more to do with the server side scheduling logic but that is still very helpful context! 🙌
This has happened to us two times already. Can confirm that this wasn't an issue in Prefect 3.0.3 and is a regression that we encountered only after upgrading to 3.4.17.
We are having the same issue, running prefect version 3.4.19 on Kubernetes.
For example, see these two flow runs.
Duplicate flow run 1:
Duplicate flow run 2:
Note the following:
- The idempotency key of both runs is not the same:
- The deployment id part of the key is the same;
- The deployment schedule id part of the key is different.
- The second run was created 6 seconds after the first run, which is almost the same as the 5 second scheduling loop interval.
- At around the same time, we did an update of the deployment using the deploy function.
We believe that this is what happens:
- The Scheduler reads the deployment info (before the deployment is updated, so with the old schedule).
- The deployment is updated and existing scheduled runs are deleted.
- The Scheduler creates new scheduled flow runs for the deployment, based on old information from step 1.
- During the next run of the Scheduler (5 seconds later), new flow runs are created for the deployment, based on the updated deployment. The updated deployment has a new schedule id, therefore the idempotency key is different and the flow runs are not deduplicated.
Hopefully this information helps to solve this bug.
We are experiencing this issue on v2.20.19
This issue still exists in version 3.4.25.
Any update? This bug makes it so that deployments require manual cleanup.
This also happens to us quite often (v3.4.14, postgres db), which is super impactful as it is not necessarily easy to ensure multiple data pipelines/processes running at the exact same time (when we expect only 1 run) won't cause any unexpected side effects (multiple deployments may be managed by many different teams/devs that do not necessarily want/can spend time to account for this when writing their pipelines); and arguably one would expect that the most basic scheduling requirement is respected
@zzstoatzz considering it happens to us every day I'm happy to try and investigate - can you point me to where we should be looking at to start debugging and drive potential fix? I started from the Scheduler service and ended up here - is that the potential culprit fn? https://github.com/PrefectHQ/prefect/blob/main/src/prefect/server/models/deployments.py#L681