prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Tasks stuck in Running state while flow is in Crashed state while using Kubernetes

Open simone201 opened this issue 4 weeks ago • 1 comments

Bug summary

Hi Prefect Team,

I'm using a self-hosted Prefect Server setup in Kubernetes, deployed using the official Helm Chart (version 2025.11.24182903). The infrastructure setup is the following:

  • AWS EKS (Auto Mode) cluster
  • AWS RDS (PostgreSQL) instance
  • AWS ElastiCache (Redis) instance

While these are the running:

  • Prefect Server 3.6.
  • Background Services split up
  • Single Prefect Worker (via same Helm version)

All the workloads are running in the same namespace (prefect) and have the network policies needed to allow full ingress and egress access across the whole namespace. I can provide the Helm chart values if needed.

I'm using the following prefect.yaml file to deploy my flow (for reference):

# Generic metadata about this project
name: api-extractor
prefect-version: 3.6.4

# build section allows you to manage and build docker images
build:
- prefect_docker.deployments.steps.build_docker_image:
    id: build_image
    requires: prefect-docker>=0.3.1
    image_name: 0000.dkr.ecr.us-xxxx-1.amazonaws.com/extractor/api-extractor
    tag: 0.0.1
    dockerfile: Dockerfile
    platform: linux/amd64

# push section allows you to manage if and how this project is uploaded to remote locations
push:
- prefect_docker.deployments.steps.push_docker_image:
    requires: prefect-docker>=0.3.1
    image_name: '{{ build_image.image_name }}'
    tag: '{{ build_image.tag }}'

# pull section allows you to provide instructions
pull:
- prefect.deployments.steps.set_working_directory:
    directory: /app

# the deployments section allows you to provide configuration for deploying flows
deployments:
- name: extractor-deployment
  version: 1.0.0
  description: This deployment orchestrates extractor
  schedule: {cron: "00 18 * * *", slug: "utc-schedule", timezone: "UTC", active: true}
  flow_name: extractor
  entrypoint: flows/extractor.py:extract_from_api
  parameters:
    sources: 
      - users
      - transactions
    targets:
      sessions: "dev-users"
      events: "dev-transactions"
    target_type: "s3"
    output_format: "json"
    start_time: yesterday
    end_time: yesterday
  work_pool:
    name: extractor-work-pool
    work_queue_name: null
    job_variables:
      image: '{{ build_image.image }}'
      finished_job_ttl: 100
      memory: 16Gi
      image_pull_policy: Always
      service_account_name: extractor-account
      node_selector:
        karpenter.sh/capacity-type: on-demand
      env:
        PREFECT_RUNNER_HEARTBEAT_FREQUENCY: "30"

The issue appears to be related to Crashed flows only for external reasons (e.g. OOM) and here's a method to reproduce it:

  1. Configure a memory intensive flow with low memory allocated to it
  2. Run it in a Kubernetes Work Pool
  3. Wait for the K8s Job to spawn, and the Pod to run
  4. After some time, the Pod gets killed because it went OOM (state OOMKilled in Kubernetes)
  5. The flow state in Prefect is correctly updated to Crashed
  6. The flow tasks states are stuck in Running (while it's not true)

The only solution to this issue is to manually delete the tasks from the crashed flow run. I'm expecting that in this case the flow tasks follow the same state as the flow one, because they run in the same Pod and therefore if the flow crashes, also the tasks must be set as crashed.

I'll be more than happy to help troubleshoot this issue even further.

Thanks in advance!

Version info

Version:              3.6.4
API version:          0.8.4
Python version:       3.12.10
Git commit:           d3c3ed50
Built:                Fri, Nov 21, 2025 06:04 PM
OS/Arch:              darwin/arm64
Profile:              dev
Server type:          server
Pydantic version:     2.12.2
Server:
  Database:           sqlite
  SQLite version:     3.51.0
Integrations:
  prefect-docker:     0.6.6
  prefect-kubernetes: 0.6.5

Additional context

No response

simone201 avatar Dec 02 '25 11:12 simone201

same is true when run completed successfully - often task runs are still in "RUNNING" state

vkrot-innio avatar Dec 03 '25 09:12 vkrot-innio

We’ve seen similar state inconsistencies cause a lot of downstream confusion.

Once execution state drifts from reality, retries and recovery tend to amplify the problem rather than resolve it.

In our case, being stricter about state transitions (and failing fast when invariants break) made the system much easier to reason about.

Zi-Ling avatar Dec 17 '25 12:12 Zi-Ling