prefect Infrastructure stays running after Flow is marked Crashed by failed Lease renewal

Bug summary

I experience zombie pods on Kubernetes that idle around after the flow has been marked Crashed due to a failed lease renewal. I believe the flow handler is not exiting properly in this case and is stuck. Unfortunately, I don't have a reliable reproduction script yet. I guess this case could also be handled the kopf observer in the worker

Version info

Version:             3.4.15
API version:         0.8.4
Python version:      3.13.7
Git commit:          95f11540
Built:               Fri, Aug 29, 2025 05:13 PM
OS/Arch:             linux/x86_64 (Kubernetes)
Profile:             ephemeral
Server type:         server
Pydantic version:    2.11.7

Additional context

No response

Sep 03 '25 07:09 marcm-ml

Thanks for the issue @marcm-ml! What does your Prefect server setup look like? If you're running multiple servers or background services separately, you'll want to ensure that you're using Redis for your lease storage. You can also trigger a lease renewal failure by deleting the lease record in Redis. Either way, I should be able to reproduce and submit a fix in the next week or so.

Sep 03 '25 14:09 desertaxle

Hey, this issue occurred on both multi server setup with redis and a single server setup without redis.

I noticed that this happens either when a pod takes longer than expected to start (like 5min) and if a very long running prefect job (>30min) and a server restart happen. Hope this helps

Sep 03 '25 15:09 marcm-ml

Any update @desertaxle ? It's becoming very hazardous to schedule our flows with prefect ...

Sep 10 '25 09:09 Spichon

Currently, we have rollbacked to client version 3.4.8 to avoid using this new API lease renewal.

Sep 10 '25 09:09 mz-jy

I haven't been able to reproduce this issue in my Kubernetes setup. I can cause a lease renewal failure by deleting the concurrency limit lease either before or after the flow run starts. In both cases, I see the flow run enter a CRASHED state and the pod exits, as expected.

I'll need more info to find the root cause of this issue. If you see this issue, please provide the following:

The prefect version you're using to run your flow
The Python version you're using to run your flow
The Docker image you're using to run your flow
A list of processes running in the zombie pod

To get the processes running in a pod container, you can run

kubectl debug -it pod/<pod-name> --image=busybox --target=prefect-job -- sh -lc 'ps -ef'

and replace <pod-name> with the name of the zombie pod.

Sep 10 '25 14:09 desertaxle

Hello, The prefect version is 3.4.15, python is 3.12.3, docker image is prefecthq/prefect:3.4.15-python3.12-kubernetes I don't have any zombie pod now as we downgraded to 3.4.8...

Sep 11 '25 04:09 Spichon

@desertaxle How do you trigger the renewal failure? For me, this happens relatively reliably if I kill the prefect-server instance while a flow is running. In this case, the setup is a single prefect server without redis. The pod where the flow is running becomes a zombie.

This happened to me with at least the last 3 prefect patch versions. Python is 3.12. I am running a custom distroless image but it contains exactly the same as the original prefect images and I am able to start server, worker and flows with the same image.

maybe this helps, showing when prefect-server is available and occurrence of lease renewal errors in flow logs:

Sep 11 '25 12:09 marcm-ml

@marcm-ml I replicate it by running a setup with Redis and deleting the lease from the Redis DB. Without Reids, the leases are stored in memory, so it makes sense that you get lease renewal failures on server restart because the leases are lost.

The part that I can't replicate is the zombie pod part. When I trigger a lease renewal failure, I get a crash and the pod stops running. Do you create any threads or subprocesses in your flow where you're seeing zombie pods? It's possible that if those aren't correctly cleaned up, then that'd lead to a zombie pod.

Sep 11 '25 14:09 desertaxle

It happens with a variety of flows. But neither create subprocesses. They also use the default task runner (ThreadRunner). The images are started as follows: "dumb-init -g -- python -m prefect.engine". After switching to single server setup, it happens less often. But I cannot say if it never happens since we have a zombie killer that kills those pods after a few minutes. Plus each pod has activeDeadlineSeconds set

Sep 12 '25 07:09 marcm-ml

@marcm-ml I think there might be an unwanted interaction between your init script and the entrypoint baked into the prefecthq/prefect Docker images. Can you share what your init script looks like?

Sep 12 '25 14:09 desertaxle

there is no init script. the dumb-init is just like tini in the prefect iamge. i think there could be a differences in that but I don't think so since these binaries simply handle signals. How are prefect flows exiting in a case of a lease failure?

Sep 12 '25 19:09 marcm-ml

We are experiencing a similar issue on Prefect Server 3.4.13 and ECS. But the problem is not a zombie flow, but rather inability to start a flow at all. It always crashes on "Concurrency lease renewal failed - slots are no longer reserved."

In the logs I see:

2025-09-22 09:28:13 - prefect.server.services.repossessor - INFO - Revoking 1 expired leases
2025-09-22 09:28:13 - prefect.server.services.repossessor - INFO - Revoking lease 122f0222-d2fd-44d5-909a-bcb12050497b for 1 concurrency limits with 1 slots

It happens 1 minute after the flow is initiated. It takes about 2-3 minutes for the ECS task with the flow to start, so maybe this is the problem.

After following @mz-jy comment and reverting to Server 3.8.4 the issue disappeared.

Sep 22 '25 11:09 himos

see #17415

Sep 23 '25 05:09 tonal

We have the same issue as @himos above — but running on Azure Container Instances (which also sometimes takes a while to start up). The flow crashes on start-up. Is there a way to extend the timeout to first heartbeat?

I can replicate locally by adding a delay to the Dockerfile:

# Add a startup delay script to force concurrency lease timeout
RUN echo '#!/bin/bash\necho "Waiting 60 seconds to force concurrency lease timeout..."\nsleep 60\necho "Delay complete, starting flow execution"\nexec "$@"' > /entrypoint-delay.sh && \
    chmod +x /entrypoint-delay.sh

# Use the delay script as entrypoint
ENTRYPOINT ["/entrypoint-delay.sh"]

Sep 29 '25 14:09 solms

Sorry for contributing to the tangent (flows crashing due to initial timeout vs infra stays running), but searching for my error led me here too so maybe it helps others.

I've added https://github.com/PrefectHQ/prefect/pull/19058 , which enables setting the initial lease timeout duration and fixes the issue where flows that are slow to start crash because the lease times out.

Sep 30 '25 08:09 solms

Facing the same problem here. Server and workers run in docker containers, it concerns flows that either have a concurrency limit in the to_deployment method or concurrency_limit in prefect.yaml file. Seems to be related to another issue we have since we upgraded, many deployments started to double trigger (even if concurrency_limit is set to 1). Error happens at the end of the flow run when concurrency should be released. Also seeing running docker containers for said crashed flow runs. Rolled back to 3.4.8 for the moment.

Nov 26 '25 14:11 greggailly