prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Intermittent schedule duplication causing double flow runs

Open Spichon opened this issue 3 months ago • 10 comments

Bug summary

Description

We are experiencing intermittent schedule duplication issues where deployments randomly create duplicate scheduled runs with identical expected start times.

Observed Behavior

  • Scheduled runs are duplicated with the same expected_start_time
  • Issue occurs sporadically across different deployments
  • Multiple flow runs are triggered for the same schedule time
  • Problem appears to resolve temporarily when schedules are toggled (deactivated/reactivated)

Version info

Prefect Server: v3.4.15
Prefect Workers: v3.4.15
Kubernetes deployment using Helm charts

Additional context

Suspected Triggers

  • System actions or maintenance operations on kube
  • Flow redeployment processes
  • Potential race conditions during schedule updates

Current Workaround

We've implemented a cleanup script that detects and fixes duplicate schedules by toggling the automation state

deployments = await client.read_deployments(
    deployment_filter=DeploymentFilter(tags=DeploymentFilterTags(all_=["prd"]))
)
logger = get_run_logger()

for d in deployments:
    logger.info(f"Checking deployment: {d.name} deployment id: {d.id} flow id: {d.flow_id}")
    scheduled_runs = await client.get_scheduled_flow_runs_for_deployments([d.id])
    
    scheduled_run_times = []
    has_double_schedule = False
    
    for i in scheduled_runs:
        if i.expected_start_time in scheduled_run_times:
            has_double_schedule = True
            logger.info(f"Double schedule detected for {d.name} at {i.expected_start_time}")
            break
        else:
            scheduled_run_times.append(i.expected_start_time)
    
    if has_double_schedule:
        schedules = await client.read_deployment_schedules(d.id)
        for schedule in schedules:
            print(f"Restarting schedule: {schedule.id}")
            await client.update_deployment_schedule(d.id, schedule.id, active=False)
            await client.update_deployment_schedule(d.id, schedule.id, active=True)

Impact

  • Unnecessary resource consumption from duplicate flow runs
  • Potential data processing inconsistencies
  • Manual intervention required to clean up duplicates
  • Operational overhead from monitoring and cleanup processes

Questions

  • Root Cause Analysis: What could be causing schedule duplication in Prefect 3.4.15?
  • Flow Redeployment: Are there known issues with schedule duplication during deployment updates?
  • Database Consistency: Could this be related to PostgreSQL transaction handling or race conditions?
  • Prevention: Are there configuration settings or best practices to prevent schedule duplication?
  • Detection: Is there a built-in mechanism to detect and prevent duplicate schedules?

Spichon avatar Sep 09 '25 06:09 Spichon

Just to note that I've been experiencing a similar issue: sporadic duplication of schedules, with no consistent way of reproducing the issue, since at least Prefect v3.4.11.

We reported the issue via [email protected] but the team has struggled to reproduce the error (as we have too!).

We've found the same solution of toggling the schedule on and off as a workaround.

Unlike the OP, we are using Prefect Cloud, and deploying onto long-lived docker containers. Despite this difference, the problem we're experiencing appears the same. We have also been suspecting that a race condition on the schedule creation could be the root-cause.

jwalton3141 avatar Sep 09 '25 09:09 jwalton3141

hi @Spichon and @jwalton3141 - thanks for the reports!

@jwalton3141 do you mind explaining this?

deploying onto long-lived docker containers

do you mean you're running prefect worker start on a long-lived container? if so, what type of worker? not sure that'd it'd matter as far as the issue here, just curious as we work on a reliable reproduction

this is a bug that has appeared before, I'll try to track that down for reference

we'll look into this!

EDIT: possibly related

  • https://github.com/PrefectHQ/prefect/issues/17281
  • https://github.com/PrefectHQ/prefect/issues/17538#issuecomment-3141012479
  • https://github.com/PrefectHQ/prefect/issues/18523

zzstoatzz avatar Sep 09 '25 16:09 zzstoatzz

Thanks @zzstoatzz!

@jwalton3141 do you mind explaining this?

We're doing something similar to what is outlined in the docs here.

Unlike the example in the docs, we're using the prefect.serve() function, not the .deploy() or .serve() methods, to create our deployments.

We're constructing our schedules using prefect.schedules.Cron(), as we want to set the timezone on the schedules.

Here's an outline of what we run in our containers:

flow = prefect.flow.from_source(...)

flow_deployment = flow.to_deployment(
    name="my-deployment",
    schedules=[prefect.schedules.Cron("0 5 * * Mon-Fri", timezone="Europe/London")],
    concurrency_limit=1,
    ...
)

prefect.serve(flow_deployment)

In reality, we build a list of deployments and serve them with a single prefect.serve() call.

Additionally, we actually run two identical long-lived Docker containers in a single Azure container group, including the same prefect.serve() calls. We were able to use two containers in a single group without issue in v1, however, I did wonder whether it could be causing the scheduled run duplication we now sporadically experience in v3.


Edit: just to include a screenshot I took previously when we experienced this issue (this was back on v3.4.11, but we have experienced it a couple of times since, on newer Prefect versions).

Image

jwalton3141 avatar Sep 10 '25 07:09 jwalton3141

great! thanks for the detail @jwalton3141

i still think this will have more to do with the server side scheduling logic but that is still very helpful context! 🙌

zzstoatzz avatar Sep 10 '25 11:09 zzstoatzz

This has happened to us two times already. Can confirm that this wasn't an issue in Prefect 3.0.3 and is a regression that we encountered only after upgrading to 3.4.17.

ubiquitousbyte avatar Oct 15 '25 08:10 ubiquitousbyte

We are having the same issue, running prefect version 3.4.19 on Kubernetes.

For example, see these two flow runs.

Duplicate flow run 1: Image

Duplicate flow run 2: Image

Note the following:

  • The idempotency key of both runs is not the same:
    • The deployment id part of the key is the same;
    • The deployment schedule id part of the key is different.
  • The second run was created 6 seconds after the first run, which is almost the same as the 5 second scheduling loop interval.
  • At around the same time, we did an update of the deployment using the deploy function.

We believe that this is what happens:

  1. The Scheduler reads the deployment info (before the deployment is updated, so with the old schedule).
  2. The deployment is updated and existing scheduled runs are deleted.
  3. The Scheduler creates new scheduled flow runs for the deployment, based on old information from step 1.
  4. During the next run of the Scheduler (5 seconds later), new flow runs are created for the deployment, based on the updated deployment. The updated deployment has a new schedule id, therefore the idempotency key is different and the flow runs are not deduplicated.

Hopefully this information helps to solve this bug.

roy-wolters-jumbo avatar Oct 20 '25 08:10 roy-wolters-jumbo

We are experiencing this issue on v2.20.19

jeffrose-relay avatar Oct 28 '25 17:10 jeffrose-relay

This issue still exists in version 3.4.25.

shdxiang avatar Oct 31 '25 05:10 shdxiang

Any update? This bug makes it so that deployments require manual cleanup.

khu-taproot avatar Nov 25 '25 14:11 khu-taproot

This also happens to us quite often (v3.4.14, postgres db), which is super impactful as it is not necessarily easy to ensure multiple data pipelines/processes running at the exact same time (when we expect only 1 run) won't cause any unexpected side effects (multiple deployments may be managed by many different teams/devs that do not necessarily want/can spend time to account for this when writing their pipelines); and arguably one would expect that the most basic scheduling requirement is respected

@zzstoatzz considering it happens to us every day I'm happy to try and investigate - can you point me to where we should be looking at to start debugging and drive potential fix? I started from the Scheduler service and ended up here - is that the potential culprit fn? https://github.com/PrefectHQ/prefect/blob/main/src/prefect/server/models/deployments.py#L681

TheoBabilon avatar Dec 04 '25 08:12 TheoBabilon