airflow Kubernetes is not reporting back workers status to Airflow

trafficstars

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.4

What happened?

Airflow schedules tasks to be performed on a Kubernetes Cluster. However for some reason when the task is completed pods are not cleared out (as it is normal). I am not sure if the bug is on the airflow side or airflow-kubernetes-provider

core.parallelism is set to 32,

Total number of tasks (running and completed) above is 32 which matches core.parallelism & our single default pool. The next step is that those 2 running jobs complete no more tasks are running.

This causes system malfunctioning because Airflow still thinks that tasks are running, but they have completed. Looks like kubernetes is not reporting the state back to Airflow and then airflow executor is running out of open slots.

Also tasks had been queued up in the scheduled state and could not be promoted to queued state. Screenshot 2024-04-22 at 11 49 51

At 10:24 airflow-scheduler has been restarted.

Marked execution shows that airflow-scheduler catches up those tasks after restart. 5th column is displaying start date time and 6th end date time. From the graph one can assume that the job usually takes up to 2 minutes rather than an hour. Screenshot 2024-04-22 at 14 00 07

We enabled debug logs on scheduler. So when it happens next time we hopefully will know more.

What you think should happen instead?

Airflow tasks should continue running.

How to reproduce

I could not reproduce the issue. But in last 3 weeks it happened 5 times on our production system. We suspect that it started breaking for us when we upgraded apache-airflow-providers-cncf-kubernetes from 7.13.0 to 8.0.1 and keeps breaking on 8.1.1 as well.

Probably related to: https://github.com/apache/airflow/issues/36998 https://github.com/apache/airflow/issues/33402

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

Deployment

Official Apache Airflow Helm Chart

Deployment details

We deployed airflow on a kubernetes cluster using KubernetesExecutor setting in a helm chart.

Anything else?

No response

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Apr 23 '24 10:04 aru-trackunit

I think this is a duplicate of #36998. BTW restarting the scheduler temporarily solves this.

Apr 23 '24 16:04 RNHTTR

I commented on that issue but I'm not sure if it's related no one there mentions that jobs are completed (apart of me). To me this one is more relevant -> https://github.com/apache/airflow/issues/33402

Apr 24 '24 10:04 aru-trackunit

Using airflow metrics I also observed that when the issue is happening there are 2 different numbers for usually associated metrics:

At 12:17 airflow-scheduler was restarted

Apr 30 '24 12:04 aru-trackunit

@aru-trackunit are you able to reproduce? Are you sure this isn't #36998 ? If it is, this patch might resolve it.

May 07 '24 21:05 RNHTTR

This issue is related to watcher is not able to scale and process the events on time. This leads to so many completed pods over the time. related: https://github.com/apache/airflow/issues/22612

May 11 '24 04:05 dirrao

@aru-trackunit, can you try with k8s provider 8.3.0? It contains #39551, which will hopefully solve your issue.

Jun 07 '24 17:06 jedcunningham

Deploying 8.3.0 let's see if the error comes up again

Jun 10 '24 08:06 aru-trackunit

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

Jun 25 '24 00:06 github-actions[bot]

This issue has been closed because it has not received response from the issue author.

Jul 02 '24 00:07 github-actions[bot]

airflow airflow copied to clipboard

Kubernetes is not reporting back workers status to Airflow

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

airflow
airflow copied to clipboard