prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Dangling flow runs from deleted schedules cause duplicates

Open Ultramann opened this issue 8 months ago • 6 comments

Bug summary

We've been experiencing incorrectly duplicated flow runs executing at the same time on our self-hosted prefect server. This is an issue for us because some of our flow require resource locks and so duplicated flows create a race condition.

Digging into the prefect database I found that the duplication seems to be coming from different idempotency keys where the difference comes from the schedule id, and the duplicate flow runs seem to be coming from schedules that no longer exist.

with cte as (
  select deployment_id, expected_start_time
  from flow_run
  group by deployment_id, expected_start_time
  having count(*) > 1
)
select cte.deployment_id, fr.created, fr.state_type, fr.idempotency_key, fr.created_by, ds.id is not null as schedule_exists
from flow_run fr
join cte
on fr.deployment_id = cte.deployment_id
and fr.expected_start_time = cte.expected_start_time
left join deployment_schedule ds
on created_by->>'id' = ds.id::text
order by cte.deployment_id, cte.expected_start_time

Below you can see some flow run records from this query. The first two rows show duplicate runs where one failed and the other completed. This isn't remarkable as we expect deployment schedules to be deleted over time. The next two rows, however, show duplicate scheduled flow runs where one has an existing schedule, and the other refers to a schedule that no longer exists. Because they refer to different schedules the idepempotency key is different and therefore the duplicates aren't detected.

Image

If I'm reading the source code correctly, I think running prefect deploy for a deployment that already exists should result in the old flow runs getting deleted, see code path below, so I'm not sure why this isn't happening for my scheduled flow runs all the time. Code path

  1. prefect deploy
  2. _run_single_deploy
  3. RunnerDeployment.apply
  4. RunnerDeployment._update
  5. DeploymentAsyncClient.update_deployment
  6. update_deployment
  7. models.deployments.update_deployment
  8. _delete_scheduled_runs

I might be missing something in that order, but I'm pretty sure it's close because I see logs for PATCH requests to /api/deployments, number 6 above, on the server during deployment. Note, we don't have slugs for our schedules.

Reproduction

I have tried and failed to create a minimal repro. I can't tell if this is because the issue is related to the way we deploy which is difficult to reproduce minimally, or if it's related to something else in our setup that doesn't exist in a local repro, or if it just occurs randomly with low probability and I haven't gotten 'lucky" enough to reproduce it. I can say that this happens consistently during our deployments.

Version info

Version:             3.2.11
API version:         0.8.4
Python version:      3.11.11
Git commit:          9481694f
Built:               Wed, Mar 5, 2025 10:00 PM
OS/Arch:             linux/x86_64
Profile:             ephemeral
Server type:         ephemeral
Pydantic version:    2.11.1
Server:
  Database:          postgresql
Integrations:
  prefect-gcp:       0.6.4

Ultramann avatar Apr 02 '25 21:04 Ultramann

Thank you for the detailed issue @Ultramann!! I've seen someone report this before and I can't figure it out -- if _delete_scheduled_run fails it should cause the PATCH update request to fail, so I don't understand how some of these runs stick around.

One possibility is that the runs are actually in a PENDING state at the time of the update -- when you see this happen, are the duplicates always "earlier" in the list of scheduled runs?

cicdw avatar Apr 02 '25 22:04 cicdw

@cicdw, thanks for the quick response. It seems the duplicates have no correlation with their order in the list; that is, I've found duplicated flow runs where the one with a deleted schedule is first in the list, and other pairs where it's second in the list.

Since the PENDING state seemed a reasonable guess I tried to reproduce the issue locally by forcing an artificially long period of PENDING state via adding a time.sleep here. That gave me a large window to recreate a deployment, on a '* * * * *' cron, while one of its scheduled runs was pending. Unfortunately, no repro. I ran this query before and after doing the redeploy

select fr.expected_start_time, fr.state_type, idempotency_key, ds.id is not null as schedule_exists
from flow_run fr left join deployment_schedule ds
on created_by->>'id' = ds.id where state_type != 'COMPLETED'
order by fr.expected_start_time limit 5;

Here's the result before

expected_start_time         state_type  idempotency_key                                               schedule_exists
--------------------------  ----------  ------------------------------------------------------------  ---------------
2025-04-03 14:45:00.000000  PENDING     scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e a185da86-b19a  1              
                                        -4d5f-b6cc-02201c3d3e6b 2025-04-03 14:45:00+00:00                            

2025-04-03 14:46:00.000000  SCHEDULED   scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e a185da86-b19a  1              
                                        -4d5f-b6cc-02201c3d3e6b 2025-04-03 14:46:00+00:00                            

2025-04-03 14:47:00.000000  SCHEDULED   scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e a185da86-b19a  1              
                                        -4d5f-b6cc-02201c3d3e6b 2025-04-03 14:47:00+00:00                            

2025-04-03 14:48:00.000000  SCHEDULED   scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e a185da86-b19a  1              
                                        -4d5f-b6cc-02201c3d3e6b 2025-04-03 14:48:00+00:00                            

2025-04-03 14:49:00.000000  SCHEDULED   scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e a185da86-b19a  1              
                                        -4d5f-b6cc-02201c3d3e6b 2025-04-03 14:49:00+00:00                         

and after

expected_start_time         state_type  idempotency_key                                               schedule_exists
--------------------------  ----------  ------------------------------------------------------------  ---------------
2025-04-03 14:45:00.000000  PENDING     scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e a185da86-b19a  0              
                                        -4d5f-b6cc-02201c3d3e6b 2025-04-03 14:45:00+00:00                            

2025-04-03 14:46:00.000000  SCHEDULED   scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e 94bbb4f1-ad91  1              
                                        -4783-b653-b1b4934391b7 2025-04-03 14:46:00+00:00                            

2025-04-03 14:47:00.000000  SCHEDULED   scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e 94bbb4f1-ad91  1              
                                        -4783-b653-b1b4934391b7 2025-04-03 14:47:00+00:00                            

2025-04-03 14:48:00.000000  SCHEDULED   scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e 94bbb4f1-ad91  1              
                                        -4783-b653-b1b4934391b7 2025-04-03 14:48:00+00:00                            

2025-04-03 14:49:00.000000  SCHEDULED   scheduled 16cba0e8-5672-419a-b5e2-6442d63f364e 94bbb4f1-ad91  1              
                                        -4783-b653-b1b4934391b7 2025-04-03 14:49:00+00:00                  

Some random thoughts I had.

  • The PENDING state cause seemed plausible to me because it would result in somewhat random occurrences

  • Other things that could have more stochastic behavior would be concurrent/parallel processes

    • Are there any during prefect deploy? I don't see any, but thought I'd check.
    • Maybe they could be processes related to server start up? We recreate updated deployments right after the server starts. Could that be causing background processes to validate/update/recreate the existing scheduled runs?
    • If you look at the original screenshot I took of the db state, you'll see that the created time for the duplicates is very close, but I don't know what's creating the old scheduled run?
  • This sequence of code seemed odd to me, in update_deployment https://github.com/PrefectHQ/prefect/blob/3f0c87e32e54e1f184df275c4263705b9ad90b6e/src/prefect/server/api/deployments.py#L303 this creates new deployment schedules at the end of the function https://github.com/PrefectHQ/prefect/blob/3f0c87e32e54e1f184df275c4263705b9ad90b6e/src/prefect/server/models/deployments.py#L275-L288 but back in update_deployment when models.deployments.update_deployment returns, we do that again https://github.com/PrefectHQ/prefect/blob/3f0c87e32e54e1f184df275c4263705b9ad90b6e/src/prefect/server/api/deployments.py#L315-L327

    There's likely a good reason for this duplication, and maybe sqlalchemy handles this gracefully, but the duplicated code caught my eye.

Ultramann avatar Apr 03 '25 15:04 Ultramann

@cicdw @zzstoatzz any other ideas on this? I'm happy to test out hypotheses. Many thanks.

Ultramann avatar Apr 16 '25 14:04 Ultramann

Thanks for the nudge @Ultramann - I've been traveling so haven't had time to go much deeper yet. Your theory is a great observation but unfortunately I don't think it explains the behavior. If schedules_to_create is non-null, then deployment.schedules has been set to None here

However, I did look a little deeper while considering your theory and realized that the following could occur during the update:

This might even be more likely to occur on server startup because the scheduler will be running more aggressively.

To test this theory, you could either compare time stamps of scheduler execution with deployment updates, or disable the scheduler temporarily during updates and see if that makes the issue go away.

I'm curious to hear if that hypothesis helps you recreate or learn more!

cicdw avatar Apr 25 '25 20:04 cicdw

Hi @cicdw, thanks for the thoughts! I was at an offsite last week, so I haven't dug into this much. Instead I did the laziest thing I could think of to test if there was a conflict between the updating and scheduling operations: I wait a long time, 5 minutes, after starting the prefect server to perform updates. I'm still seeing the issue, so it might be that there's not a ton of relationship between server startup and my issue, or that I didn't wait long enough, or I'm sure other things.

To test this theory, you could either compare time stamps of scheduler execution with deployment updates, or disable the scheduler temporarily during updates

I'd like to test this idea; where do I find those timestamps and here is the scheduler triggered, so I can disable it?

Thanks again!

Ultramann avatar May 08 '25 15:05 Ultramann

I feel like we're getting close! Here are the two pieces of info you'll need:

  • you can disable the scheduler by setting the environment variable / Prefect setting PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=False
  • to see scheduler execution timestamps, set PREFECT_LOGGING_LEVEL='DEBUG' and you should see logs of the form "About to run Scheduler" with an associated timestamp and then a 'Scheduled N runs." INFO level log when the scheduler loop is complete

cicdw avatar May 12 '25 21:05 cicdw

@cicdw. Sorry for the recent radio silence. When you provided info on the PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED setting it occurred to me that we occasionally run a prefect server outside of prod, but which connects to our prod prefect database to testing things. I disabled scheduling in those situations and we haven't seen any instances of this issue for a month. So it seems the root cause of this issue had to do with our usage of prefect, not an issue with prefect itself. Thanks so much for the help getting to the bottom of this!

Ultramann avatar Jun 18 '25 15:06 Ultramann