Kubernetes Executor leaving pods in a "Running" state after a task is completed
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.9.2 and 2.8.3
What happened?
Using the Kubernetes Executor, sometimes when a task is completed it will not shut down the pod and leave it in a "Running" state. It happens randomly. Looking at the logs, it says "Task exited with return code 0" but the pod is still up. I tried upgrading to 2.9.2 but it is still happening in this Airflow version as well.
What you think should happen instead?
Once a task is completed it should complete the pod and spin it down
How to reproduce
Really hard to reproduce as it happens randomly and for no reason that I can tell. Which is why I am posting this to get debug help.
Operating System
Debian GNU/Linux 12 (bookworm)
Versions of Apache Airflow Providers
apache-airflow-providers-apache-kafka==1.5.0 apache-airflow-providers-cncf-kubernetes==8.3.3 apache-airflow-providers-common-io==1.3.2 apache-airflow-providers-common-sql==1.14.0 apache-airflow-providers-fab==1.1.1 apache-airflow-providers-ftp==3.9.1 apache-airflow-providers-http==4.11.1 apache-airflow-providers-imap==3.6.1 apache-airflow-providers-microsoft-azure==10.2.0 apache-airflow-providers-postgres==5.11.2 apache-airflow-providers-smtp==1.7.1 apache-airflow-providers-sqlite==3.8.1
Deployment
Official Apache Airflow Helm Chart
Deployment details
Using the official Airflow helm chart to deploy Airflow to a Kubernetes namespace running Kubernetes version 1.29. This first started happening on Airflow version 2.8.3 and Helm Chart 1.12.0. We then upgraded to 2.9.2 and Helm Chart 1.14.0, as well as upgraded Python versions from 3.8 to 3.12 and this is still happening.
Anything else?
Really appreciate any help or advice on how I can best debug this to see what is happening and how to fix. Tried a lot of things and I have came up stumped.
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Can you check which containers are running? From what I saw so far, most of the times users report this issue they have additional side-car container that is the root cause for the pod not terminating.
Can you check which containers are running? From what I saw so far, most of the times users report this issue they have additional side-car container that is the root cause for the pod not terminating.
I have 2 init containers that run when my job pods spin up, and they both say completed. So I do not think its that. I looked in my database (postgresql) and there are no blocked queries showing up... Really not for sure what is happening and don't know what logs to look at
To update, I went to just one scheduler and it still didn't fix the issue.. Anybody have any debugging tips on what logs I should look at to see why it is sometimes not updating the database? This is causing my namespace to fill up and back up our DAGs so would really appreciate some debug help
Another update for anybody that sees this, tried to look at pod and scheduler logs combined in our Splunk instance and this is what I found:
For a task that completes and closes normally, the "kubernetes_executor_utils" will log that it has succeeded, like so:
2024-08-01T08:21:01.528695415-05:00 stdout F [2024-08-01T13:21:01.528+0000] {taskinstance.py:1206} INFO - Marking task as SUCCESS. dag_id=DAG_ID, task_id=delete_data, run_id=scheduled__2024-08-01T13:10:00+00:00, execution_date=20240801T131000, start_date=20240801T132059, end_date=20240801T132101
2024-08-01T08:21:01.587171193-05:00 stdout F [2024-08-01T13:21:01.586+0000] {local_task_job_runner.py:240} INFO - Task exited with return code 0
2024-08-01T08:21:01.611141571-05:00 stdout F [2024-08-01T13:21:01.610+0000] {taskinstance.py:3503} INFO - 1 downstream tasks scheduled from follow-on schedule check
2024-08-01T08:21:01.612627302-05:00 stdout F [2024-08-01T13:21:01.612+0000] {local_task_job_runner.py:222} INFO - ::endgroup::
2024-08-01T08:21:01.703212706-05:00 stdout F [2024-08-01T13:21:01.701+0000] {base.py:84} INFO - Using connection ID 'CONN' for task execution.
2024-08-01T08:21:01.924805315-05:00 stdout F [2024-08-01T13:21:01.923+0000] {get_token_mixin.py:104} INFO - ClientSecretCredential.get_token succeeded
2024-08-01T08:21:02.968331016-05:00 stdout F [2024-08-01T13:21:02.967+0000] {kubernetes_executor_utils.py:299} INFO - Event: POD_NAME is Running, annotations: <omitted>
2024-08-01T08:21:04.525216578-05:00 stdout F [2024-08-01T13:21:04.524+0000] {kubernetes_executor_utils.py:284} INFO - Event: POD_NAME Succeeded, annotations: <omitted>
For a task that stays in a "Running" state, the kubernetes_executor_utils does not say it succeed and keeps saying it is "Running" even though the task gets marked as succeeded and returns with an exit code of 0:
2024-07-31T23:50:25.538069838-05:00 stdout F [2024-08-01T04:50:25.537+0000] {base.py:294} INFO - Success criteria met. Exiting.
2024-07-31T23:50:25.553145453-05:00 stdout F [2024-08-01T04:50:25.552+0000] {taskinstance.py:1206} INFO - Marking task as SUCCESS. dag_id=DAG_ID, task_id=http_sensor_check, run_id=scheduled__2024-08-01T04:40:00+00:00, execution_date=20240801T044000, start_date=20240801T045025, end_date=20240801T045025
2024-07-31T23:50:25.578659388-05:00 stdout F [2024-08-01T04:50:25.577+0000] {local_task_job_runner.py:240} INFO - Task exited with return code 0
2024-07-31T23:50:25.592799182-05:00 stdout F [2024-08-01T04:50:25.592+0000] {local_task_job_runner.py:222} INFO - ::endgroup::
2024-07-31T23:50:25.683516859-05:00 stdout F [2024-08-01T04:50:25.681+0000] {base.py:84} INFO - Using connection ID 'CONN' for task execution.
2024-07-31T23:50:25.985346073-05:00 stdout F [2024-08-01T04:50:25.984+0000] {get_token_mixin.py:104} INFO - ClientSecretCredential.get_token succeeded
2024-07-31T23:52:30.827037969-05:00 stdout F [2024-08-01T04:52:30.826+0000] {kubernetes_executor_utils.py:299} INFO - Event: POD_NAME is Running, annotations: <omitted>
(This message goes on and on and on)
So my question is, why randomly does the kubernetes_executor_utils not mark the task as succeeded? @eladkal Any tips on more debugging steps I can take? Really just don't get why this happens randomly. If I only have 1 scheduler up I wouldn't think it is a database problem
Back to comment on my current situation. After looking at TONs of logs and messing with many settings (turning off remote logging, going to 1 scheduler, messing with pgbouncer metadataPoolSize and resultBackendPoolSize, messing with all the scheduler tuneable settings, messing with parallelism and parsing processes) my issues still persist....
I have turned on DEBUG for logs and have looked at logs that go into our Splunk instance, and this is what I can find:
2024-08-05T15:02:29.338627294-05:00 stdout F [2024-08-05T20:02:29.338+0000] {taskinstance.py:1206} INFO - Marking task as SUCCESS. dag_id=DAGID, task_id=send_email, run_id=scheduled__2024-08-05T19:00:00+00:00, execution_date=20240805T190000, start_date=20240805T200229, end_date=20240805T200229
2024-08-05T15:02:29.338882785-05:00 stdout F [2024-08-05T20:02:29.338+0000] {taskinstance.py:562} DEBUG - Task Duration set to 0.308998
2024-08-05T15:02:29.349812594-05:00 stdout F [2024-08-05T20:02:29.349+0000] {cli_action_loggers.py:88} DEBUG - Calling callbacks: []
2024-08-05T15:02:29.400439168-05:00 stdout F [2024-08-05T20:02:29.399+0000] {local_task_job_runner.py:240} INFO - Task exited with return code 0
2024-08-05T15:02:29.421373484-05:00 stdout F [2024-08-05T20:02:29.420+0000] {dagrun.py:931} DEBUG - number of tis tasks for <DagRunDAGID @ 2024-08-05 19:00:00+00:00: scheduled__2024-08-05T19:00:00+00:00, state:running, queued_at: 2024-08-05 20:00:00.329940+00:00. externally triggered: False>: 0 task(s)
2024-08-05T15:02:29.421554705-05:00 stdout F [2024-08-05T20:02:29.421+0000] {taskinstance.py:3503} INFO - 0 downstream tasks scheduled from follow-on schedule check
2024-08-05T15:02:29.423594544-05:00 stdout F [2024-08-05T20:02:29.423+0000] {local_task_job_runner.py:222} INFO - ::endgroup::
As you can see, by the logs the task looks like it marks as a success and returns with an exit code of 0. It then logs the logs to remote logging and disposes of the db connection pool:
2024-08-05T15:02:29.964711666-05:00 stdout F [2024-08-05T20:02:29.964+0000] {connectionpool.py:546} DEBUG - BLOBURL "PUT /airflow-logs/wasb-airflow-logs/dag_id%3DDAGID/run_id%3Dscheduled__2024-08-05T19%3A00%3A00%2B00%3A00/task_id%3Dsend_email/attempt%3D1.log HTTP/11" 201 0
2024-08-05T15:02:29.965739791-05:00 stdout F [2024-08-05T20:02:29.965+0000] {cli_action_loggers.py:88} DEBUG - Calling callbacks: []
2024-08-05T15:02:29.969201556-05:00 stdout F [2024-08-05T20:02:29.968+0000] {settings.py:363} DEBUG - Disposing DB connection pool (PID 7)
Once this has happened, the pod should complete and close, but it doesn't... Looking at the scheduler logs right after, this is what happens:
2024-08-05T15:03:03.773641728-05:00 stdout F [2024-08-05T20:03:03.773+0000] {kubernetes_executor_utils.py:163} DEBUG - Event: DAGNAME-d859kywo had an event of type ADDED
2024-08-05T15:03:03.773756004-05:00 stdout F [2024-08-05T20:03:03.773+0000] {kubernetes_executor_utils.py:299} INFO - Event: DAGNAME-d859kywo is Running, annotations: <omitted>
2024-08-05T15:03:35.038948792-05:00 stdout F [2024-08-05T20:03:35.038+0000] {kubernetes_executor_utils.py:163} DEBUG - Event: DAGNAME-d859kywo had an event of type ADDED
2024-08-05T15:03:35.039130315-05:00 stdout F [2024-08-05T20:03:35.038+0000] {kubernetes_executor_utils.py:299} INFO - Event: DAGNAME-d859kywo is Running, annotations: <omitted>
And repeats like this over and over again and does not close the pod... On a pod that successfully closes, this is what the logs look like:
2024-08-05T07:54:45.288518222-05:00 stdout F [2024-08-05T12:54:45.287+0000] {settings.py:363} DEBUG - Disposing DB connection pool (PID 7)
2024-08-05T07:54:47.023200045-05:00 stdout F [2024-08-05T12:54:47.022+0000] {kubernetes_executor_utils.py:163} DEBUG - Event: DAGNAME-0bsziyyj had an event of type MODIFIED
2024-08-05T07:54:47.023349257-05:00 stdout F [2024-08-05T12:54:47.023+0000] {kubernetes_executor_utils.py:299} INFO - Event: DAGNAME-0bsziyyj is Running, annotations: <omitted>
2024-08-05T07:54:48.321534896-05:00 stdout F [2024-08-05T12:54:48.321+0000] {kubernetes_executor_utils.py:163} DEBUG - Event: DAGNAME-0bsziyyj had an event of type MODIFIED
2024-08-05T07:54:48.32176453-05:00 stdout F [2024-08-05T12:54:48.321+0000] {kubernetes_executor_utils.py:284} INFO - Event: DAGNAME-0bsziyyj Succeeded, annotations: <omitted>
2024-08-05T07:54:48.435163144-05:00 stdout F [2024-08-05T12:54:48.434+0000] {kubernetes_executor_utils.py:478} DEBUG - Processing task ('DAGNAME-0bsziyyj', 'eis-dw-dev-k8s', None, {'dag_id': 'DAGNAME', 'task_id': 'send_email', 'execution_date': None, 'run_id': 'scheduled__2024-08-05T12:40:00+00:00', 'try_number': '1'}, '930001253')
2024-08-05T07:54:48.435228658-05:00 stdout F [2024-08-05T12:54:48.435+0000] {kubernetes_executor_utils.py:486} DEBUG - Attempting to finish pod; pod_name: DAGNAME-0bsziyyj; state: None; annotations: <omitted>
2024-08-05T07:54:48.435486516-05:00 stdout F [2024-08-05T12:54:48.435+0000] {kubernetes_executor_utils.py:494} DEBUG - finishing job TaskInstanceKey(dag_id='DAGNAME', task_id='send_email', run_id='scheduled__2024-08-05T12:40:00+00:00', try_number=1, map_index=-1) - None (DAGNAME-0bsziyyj)
2024-08-05T07:54:48.436378934-05:00 stdout F [2024-08-05T12:54:48.436+0000] {kubernetes_executor.py:370} INFO - Changing state of (TaskInstanceKey(dag_id='DAGNAME', task_id='send_email', run_id='scheduled__2024-08-05T12:40:00+00:00', try_number=1, map_index=-1), None, 'DAGNAME-0bsziyyj', 'NAMESPACE', '930001253') to None
2024-08-05T07:54:48.436517827-05:00 stdout F [2024-08-05T12:54:48.436+0000] {kubernetes_executor_utils.py:442} DEBUG - Deleting pod DAGNAME-0bsziyyj in namespace NAMESPACE
So it seems like for some reason the "kubernetes_executor_utils" file is looking at the completed task and not seeing a "succeeded" state, even though the task logs that it is marking it as succeeded, and it also shows up as succeeded in the Airflow UI...
Will try and keep digging but am really unsure where to look at this point.
@ahipp13 Can you run and share the following command result once the task is succeeded? kubectl describe pod <pod_name>
In my case, I am using CeleryKubernetesExecutor and tasks running on KubernetesExecutor are also facing the same issue.
airflow-scheduler log in my case:
[2024-08-26T09:35:30.915+0000] {retries.py:91} DEBUG - Running SchedulerJobRunner._create_dagruns_for_dags with retries. Try 1 of 3
[2024-08-26T09:35:30.925+0000] {retries.py:91} DEBUG - Running SchedulerJobRunner._get_next_dagruns_to_examine with retries. Try 1 of 3
[2024-08-26T09:35:30.931+0000] {retries.py:91} DEBUG - Running SchedulerJobRunner._get_next_dagruns_to_examine with retries. Try 1 of 3
[2024-08-26T09:35:30.936+0000] {retries.py:91} DEBUG - Running SchedulerJobRunner._schedule_all_dag_runs with retries. Try 1 of 3
[2024-08-26T09:35:30.948+0000] {scheduler_job_runner.py:412} DEBUG - No tasks to consider for execution.
[2024-08-26T09:35:30.950+0000] {base_executor.py:222} DEBUG - 0 running task instances
[2024-08-26T09:35:30.950+0000] {base_executor.py:223} DEBUG - 0 in queue
[2024-08-26T09:35:30.951+0000] {base_executor.py:224} DEBUG - 8 open slots
[2024-08-26T09:35:30.952+0000] {base_executor.py:243} DEBUG - Calling the <class 'airflow.providers.celery.executors.celery_executor.CeleryExecutor'> sync method
[2024-08-26T09:35:30.952+0000] {celery_executor.py:325} DEBUG - No task to query celery, skipping sync
[2024-08-26T09:35:30.952+0000] {base_executor.py:222} DEBUG - 1 running task instances
[2024-08-26T09:35:30.952+0000] {base_executor.py:223} DEBUG - 0 in queue
[2024-08-26T09:35:30.952+0000] {base_executor.py:224} DEBUG - 7 open slots
[2024-08-26T09:35:30.953+0000] {base_executor.py:243} DEBUG - Calling the <class 'airflow.providers.cncf.kubernetes.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2024-08-26T09:35:30.953+0000] {kubernetes_executor.py:367} DEBUG - self.running: {TaskInstanceKey(dag_id='my_dag', task_id='my-operator', run_id='manual__2024-08-26T16:20:48+07:00', try_number=1, map_index=-1)}
[2024-08-26T09:35:30.954+0000] {kubernetes_executor_utils.py:475} DEBUG - Syncing KubernetesExecutor
[2024-08-26T09:35:30.954+0000] {kubernetes_executor_utils.py:383} DEBUG - KubeJobWatcher for namespace nprod-green-lab-airflow alive, continuing
[2024-08-26T09:35:30.955+0000] {kubernetes_executor.py:454} DEBUG - Next timed event is in 4.329895
[2024-08-26T09:35:30.956+0000] {retries.py:91} DEBUG - Running Job._fetch_from_db with retries. Try 1 of 3
[2024-08-26T09:35:30.959+0000] {retries.py:91} DEBUG - Running Job._update_heartbeat with retries. Try 1 of 3
[2024-08-26T09:35:30.962+0000] {job.py:214} DEBUG - [heartbeat]
[2024-08-26T09:35:30.963+0000] {event_scheduler.py:39} DEBUG - Calling <bound method SchedulerJobRunner._emit_pool_metrics of <airflow.jobs.scheduler_job_runner.SchedulerJobRunner object at 0x7f4dd128a400>>
[2024-08-26T09:35:30.969+0000] {scheduler_job_runner.py:990} DEBUG - Next timed event is in 4.482330
[2024-08-26T09:35:30.969+0000] {scheduler_job_runner.py:992} DEBUG - Ran scheduling loop in 0.05 seconds
[2024-08-26T09:35:31.971+0000] {retries.py:91} DEBUG - Running SchedulerJobRunner._create_dagruns_for_dags with retries. Try 1 of 3
[2024-08-26T09:35:31.983+0000] {retries.py:91} DEBUG - Running SchedulerJobRunner._get_next_dagruns_to_examine with retries. Try 1 of 3
[2024-08-26T09:35:31.994+0000] {retries.py:91} DEBUG - Running SchedulerJobRunner._get_next_dagruns_to_examine with retries. Try 1 of 3
[2024-08-26T09:35:31.999+0000] {retries.py:91} DEBUG - Running SchedulerJobRunner._schedule_all_dag_runs with retries. Try 1 of 3
@dirrao When I ran the describe command it would look normal expect for the events which would show this:
Normal Created 31m kubelet Created container base
Normal Started 31m kubelet Started container base
Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "0a65c5a2dfde46a83b5b1130ba705de7d780e4fcc5d90596cf06db91e1d3a449": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4939253e0a46ef39954ab8fb8ff4b7d35578767315a0fb40c0e053eb00425671": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "a75eb22e87e74157394f80e745ade7a992050d4f6035077a1558b582b3385f00": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "9465cef7fcaeb162a70125714d1e109582282c01246f8f8ca3c9ebd4a4770b0e": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "af405bb2320967255325cf2ea82cf34e07506f4b9b0b1f35e4223682ce679613": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "777cf5acd0622ab30260db58730cffd89f0ab93be8cba8767659e5f0de03027e": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
Warning Unhealthy 30m kubelet Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "a79560a2e579ab41c38a065fcb170f416c11ebc4e5adb9bfd273b0277d05d8da": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "eb614b6e172250c3da62d401901227660798bc4b53124f0a0dbd9b1f73c93f65": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
Warning Unhealthy 30m kubelet Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4c4bbee388a9d95721eb1ee4cd1f2dae2d000305238cd7fdb5ce8077679f6ed7": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
Warning Unhealthy 77s (x704 over 30m) kubelet (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "48ef51c25c4b02b9e2c93fd40c738d7cb36904d85bf4fa152d46eba47f879322": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
@databius and all those in the future looking, I talked with the team that runs the Kubernetes cluster I am on and they said that there is a bug with exec prob failure:
" So we think you may be running into this bug: https://github.com/kubernetes/kubernetes/issues/122591
For troubleshooting the exec probe failure, here's a link from Google: https://cloud.google.com/kubernetes-engine/docs/troubleshooting/container-runtime#exec-probe-timeout
Here's the important piece of that "On containerd images, probe results returned after the declared timeoutSeconds threshold are ignored."
So what we think is happening, is that the container is completing because it executes the command + args supplied to it, but this bug with liveness exec command thinks that the container is still running and ready, hence the pod not being marked as complete. "
I am required to have liveness and readiness probes for all pods, so I have probes on my DAG containers. So what I did was increased the timeout and times for them like this:
livenessProbe:
exec:
command:
- cat
- /etc/os-release
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 600
readinessProbe:
exec:
command:
- cat
- /etc/os-release
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 600
And now we rarely see one of our pods do it... If you don't have liveness or readiness probes on your DAGs I am not for sure, but this is what helped fix it on our end.
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This is very similar to an issue we are facing too. We have pods that are left in a "Running" state but the container has exited. This causes future jobs to not start on time and causes them to back up.
We also have an exec check for liveness.
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.