airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Enable kubernetes_pod_operator to reattach_on_restart when the worker dies

Open yeachan153 opened this issue 3 years ago • 3 comments

Description

The kubernetes_pod_operator currently has a reattach_on_restart parameter that attempts to reattach to running pods instead of creating a new pod in case a scheduler dies while the task is running.

We would like for this feature to also work when the worker dies as well. Currently, a dying worker receives a SIGTERM and triggers the on_kill method: https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/models/taskinstance.py#L1425

This ends up deleting the pod that was created: https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py#L438

We currently got around this problem by removing the the on_kill call upon receiving a SIGTERM and pushing an xcom indicating that the worker was killed. We then enabled retries for the kubernetes_pod_operator and modified the is_eligible_to_retry function to check for the presence of this xcom and only retry if found, allowing us to retry only when the worker was killed.

Unfortunately, this is not a perfect solution because clearing a task / stopping a task via the UI triggers the same signal handler as when a worker is killed externally. Therefore, with this workaround, stopping the task (via UI) now does not kill the pod, and clearing the task (via UI) causes a reattach when we would ideally like a restart.

Use case/motivation

Since the pod itself may fail for a valid reason, we don't just want to add more retries. In that situation, it will also not re-attach but start a completely new pod since the original pod would have been cleaned up.

We specifically want the reattaching to happen when the worker dies for infrastructure related reasons. This is useful for instance, during deployment updates in kubernetes. It's currently quite a disruptive process because all the running pods are first killed, and if retries are not enabled (for reasons mentioned above), we have to restart all of them again (and potentially lose all the progress on any expensive operations that were running pre-deployment).

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

yeachan153 avatar Mar 01 '22 15:03 yeachan153

Do you have proposal to change the behaviour? Opening PR for that would be useful. Airflow has ~2000 contributors so you can become one of them. How do you think it can be improved?

potiuk avatar Mar 06 '22 22:03 potiuk

@yeachan153 Did you ever solve this problem? We would love to be able to keep pods running during environment restarts, and it looks like your idea might work.

wircho avatar Mar 22 '24 20:03 wircho

@wircho Increasing the termination_grace_period should help to mitigate this issue.

paramjeet01 avatar May 20 '24 08:05 paramjeet01

Did you find a solution to this issue using KubernetesPodOperator parameters?

wenceslas-sanchez avatar Jun 22 '24 19:06 wenceslas-sanchez

We tried termination_grace_period. Curious if anyone has any other solutions for workers reconnecting to the running pod?

troyharvey avatar Oct 30 '24 20:10 troyharvey