airflow KubernetesPodOperator/KubernetesExecutor: Failed to adopt pod 422

Apache Airflow version

2.3.0

What happened

Here i provide steps to reproduce this.

Goal of this: to describe how to reproduce the "Failed to Adopt pod" error condition.

The DAG->step Described Below should be of type KubernetesPodOperator

NOTE: under normal operation, (where the MAIN_AIRFLOW_POD is never recycled by k8s, we will never see this edge-case) (it is only when the workerPod is still running, but the MAIN_AIRFLOW_POD is suddenly restarted/stopped) (that we would see orphan->workerPods)

1] Implement a contrived-DAG, with a single step -> which is long-running (e.g. 6 minutes) 2] Deploy your airflow-2.1.4 / airfow-2.3.0 together with the contrived-DAG 3] Run your contrived-DAG. 4] in the middle of running the single-step, check via "kubectl" that your Kubernetes->workerPod has been created / running 5] while workerPod still running, do "kubectl delete pod <OF_MAIN_AIRFLOW_POD>". This will mean that the workerPod becomes an orphan. 6] the workerPod still continues to run through to completion. after which the K8S->status of the pod will be Completed, however the pod doesn't shut down itself. 7] "kubectl" start up a new <MAIN_AIRFLOW_POD> so the web-ui is running again. 8] MAIN_AIRFLOW_POD->webUi - Run your contrived-DAG again 9] while the contrived-DAG is starting/tryingToStart etc, you will see in the logs printed out "Failed to adopt pod" -> with 422 error code.

The step-9 with the error message, you will find two appearances of this error msg in the airflow-2.1.4, airflow-2.3.0 source-code. The step-7 may also - general logging from the MAIN_APP - may also output the "Failed to adopt pod" error message also.

What you think should happen instead

On previous versions of airflow e.g. 1.10.x, the orphan-workerPods would be adopted by the 2nd run-time of the airflowMainApp and either used to continue the same DAG and/or cleared away when complete.

This is not happening with the newer airflow 2.1.4 / 2.3.0 (presumably because the code changed), and upon the 2nd run-time of the airflowMainApp - it would seem to try to adopt-workerPod but fails at that point ("Failed to adopt pod" in the logs and hence it cannot clear away orphan pods).

Given this is an edge-case only, (i.e. we would not expect k8s to be recycling the main airflowApp/pod anyway), it doesn't seem totally urgent bug. However, the only reason for me raising this issue with yourselves is that given any k8s->namespace, in particular in PROD, over time (e.g. 1 month?) the namespace will slowly be being filled up with orphanPods and somebody would need to manually log-in to delete old pods.

How to reproduce

Here i provide steps to reproduce this.

Goal of this: to describe how to reproduce the "Failed to Adopt pod" error condition.

The DAG->step Described Below should be of type KubernetesPodOperator

NOTE: under normal operation, (where the MAIN_AIRFLOW_POD is never recycled by k8s, we will never see this edge-case) (it is only when the workerPod is still running, but the MAIN_AIRFLOW_POD is suddenly restarted/stopped) (that we would see orphan->workerPods)

1] Implement a contrived-DAG, with a single step -> which is long-running (e.g. 6 minutes) 2] Deploy your airflow-2.1.4 / airfow-2.3.0 together with the contrived-DAG 3] Run your contrived-DAG. 4] in the middle of running the single-step, check via "kubectl" that your Kubernetes->workerPod has been created / running 5] while workerPod still running, do "kubectl delete pod <OF_MAIN_AIRFLOW_POD>". This will mean that the workerPod becomes an orphan. 6] the workerPod still continues to run through to completion. after which the K8S->status of the pod will be Completed, however the pod doesn't shut down itself. 7] "kubectl" start up a new <MAIN_AIRFLOW_POD> so the web-ui is running again. 8] MAIN_AIRFLOW_POD->webUi - Run your contrived-DAG again 9] while the contrived-DAG is starting/tryingToStart etc, you will see in the logs printed out "Failed to adopt pod" -> with 422 error code.

The step-9 with the error message, you will find two appearances of this error msg in the airflow-2.1.4, airflow-2.3.0 source-code. The step-7 may also - general logging from the MAIN_APP - may also output the "Failed to adopt pod" error message also.

Operating System

kubernetes

Versions of Apache Airflow Providers

No response

Deployment

Other 3rd-party Helm chart

Deployment details

nothing special.

it (CI/CD pipeline) builds the app. using requirements.txt to pull-in all the required python dependencies (including there is a dependency for the airflow-2.1.4 / 2.3.0)

it (CI/CD pipeline) packages the app as an ECR image & then deploy directly to k8s namespace.

Anything else

this is 100% reproducible each & every time. i have tested this multiple times.

also - i tested this on the old airflow-1.10.x a couple of times to verify that the bug did not exist previously

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

May 30 '22 07:05 BillSullivan2020

Thanks for opening your first issue here! Be sure to follow the issue template!

May 30 '22 07:05 boring-cyborg[bot]

Sounds like an interesting case to look at @dstandish :)

Jun 04 '22 20:06 potiuk

@BillSullivan2020 @dstandish

We were facing this after (finally) upgrading from 1.10.15 to 2.0.2. We ended up finding out that the root cause was duplicate environment variables in the worker pod definition. We checked the K8s API Server Logs and that was one of the messages around the 422 error. After tidying up our helm values pods could get adopted properly.

We do get some transient errors where some pods don't get adopted with this error code:

ERROR - attempting to adopt task <TASK_ID> in dag <DAG_ID> which was not specified by database

Jun 22 '22 22:06 whitleykeith

Hi Keith, Thanks for letting me know this about the pod (duplicate metadata), if this helps us to resolve - it would be great! In the next few days i will re-create the issue & check this point. -> i will provide an update soon.

Jun 27 '22 09:06 BillSullivan2020

Hello. i put this issue aside and stopped working on it. Now i have allocated some time again. i have some feedback & a question. Duplicate metadata was probable - however now i have the simplest-dag possible to reproduce the problem. i will show you

I upgraded to airflow-2.3.3
I wrote the simplest dag (which only uses a shell-script-command with a delay)
Same approach - workerPod - restart the main airflow pod, and see an error message - workerPod seems to be an orphan

Simplified DAG

airflow_adopt_pod_test_dag = DAG( dag_id='airflow_adopt_pod_test_dag', default_args=args, catchup=False, schedule_interval='5 4 * * *' )

task = KubernetesPodOperator(task="airflow-adopt-pod-test_process", namespace=OUR_CORRECT_NAME, service_account_name=OUR_CORRECT_NAME, image=OUR_CORRECT_INTERNAL_ECR_REPO_WITH_ALPINE_IMAGE, cmds=["sh", "-c", "mkdir -p /airflow/xcom/;sleep 600;echo '[1,2,3,4]' > /airflow/xcom/return.json"], name="write-xcom", do_xcom_push=True, is_delete_operator_pod=True, in_cluster=True, task_id="write-xcom", get_logs=True, dag=dial_dag)

The fact that it has sleep 600 means that it sticks around for some minutes before that pod becomes Completed.

[2022-08-10 11:25:53,796] {kubernetes_executor.py:729} INFO - attempting to adopt pod airflowadoptpodtestdagwritexco-a27600cf4cf84f5b92a985f7e086057d [2022-08-10 11:25:53,810] {kubernetes_executor.py:745} INFO - Failed to adopt pod airflowadoptpodtestdagwritexco-a27600cf4cf84f5b92a985f7e086057d. Reason: (422) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"airflowadoptpodtestdagwritexco-a27600cf4cf84f5b92a985f7e086057d\" is invalid: spec: Forbidden: pod updates may not change fields other than

The Forbidden error message comes from Kubernetes. If you search online you can see that for a given Pod, it has a set of metadata, some of which is readonly, some of which is modifyable.
airflow source-code uses the kubernetes-client library underneath.
airflow source-code at the "Failed to adopt pod" part is

kube_client.patch_namespaced_pod( name=pod.metadata.name, namespace=pod.metadata.namespace, body=PodGenerator.serialize_pod(pod), )

Now, this function is provided by the kubernetes-client library, in there, we see that

return self.api_client.call_api( '/api/v1/namespaces/{namespace}/pods/{name}', 'PATCH', path_params, query_params, header_params, body=body_params, post_params=form_params, files=local_var_files, response_type='V1Pod', # noqa: E501 auth_settings=auth_settings, async_req=local_var_params.get('async_req'), _return_http_data_only=local_var_params.get('_return_http_data_only'), # noqa: E501 _preload_content=local_var_params.get('_preload_content', True), _request_timeout=local_var_params.get('_request_timeout'), collection_formats=collection_formats) see that it does a PATCH, i.e. it is asking kubernetes to modify/merge/update the metadata of that particular pod. And, as i explained, some of this metadata is readonly in kubernetes. Hence, the error condition i outlined above.

So my question is, do we have any way to solve this ? Also, why are we patching a pod in order to adopt it ? For me, it seems like a bug in the kubernetes-client library. The side effect of this, means that we could end up with a growing number of orphan pods in the k8s-namespace.

I'll monitor this over the coming few days. (orphan pods).

Aug 10 '22 14:08 BillSullivan2020

I totally appreciate that everyone is busy. Please can you provide an update.

Sep 15 '22 15:09 BillSullivan2020

Before someone can look at it can you please upgrade to 2.3.4 and see if it is fixed? yes, we were all busy.

This does not seem like a widespread issue so likely some of your evironmental issue, that's why I think you have not heard from anyone (and likely that might take some time). This is a software you get for free and people here respond in their free time, so if there is something that looks like not a problem of "everyone", it might get very low priority from the maintainers, hopeing that maybe other users might provide information, or maybe the user will find the problem on their own. Typically when you really need a solution because your business depends on you I guess hiring someone who can help to solve the issue might be a more "sure way" of getting help (if the issue seems to be related to your environment),

(This is just to explain you what expectations you might have for answers here).

BTW, Are you sure you upgraded properly and used constraints? https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html - this is the only way Airflow should be installed. You seem to suggest that this is a problem with kubernetes library, so maybe this is the problem you have?

Just guessing - but did you check if you have the righ version of the library/followed the installation process properly?

Sep 15 '22 15:09 potiuk