airflow icon indicating copy to clipboard operation
airflow copied to clipboard

kubernetes_executor does not properly fail task-instances in cleanup_stuck_queued_tasks

Open waldoppper opened this issue 10 months ago • 4 comments

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.3

What happened?

I'm chasing an issue involving a hard-to-reproduce issue with deferrable operators being "stuck" in Queued state.

In debugging, I noticed what seems to be a defficiency in the kubernetes_executor's override of cleanup_stuck_queued_tasks is not actually failing the instance.

What you think should happen instead?

Background: In debugging, I focused on the logs available at default log-level, which included

Marking task instance <TaskInstance: my_dag.my_group.my_task scheduled__2024-03-25T20:00:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried.

In reviewing the code printing this message, it seems clear that the expectation of a base_executor subclass is to fail the task-instances. For reference, the celery_executor does.

Problem: The kubernetes_executor does not.

Solution: It seems to me like it should.

How to reproduce

The true root cause of my issue is still a mystery to me. This is an attempt at fixing this safety net.

Operating System

debian

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==8.0.0

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

waldoppper avatar Apr 16 '24 22:04 waldoppper

@waldoppper , Did you mitigate this issue ? Found any temporary solution ?

paramjeet01 avatar Apr 24 '24 11:04 paramjeet01

We've tried manually deleting the task instance with no luck. (this makes me think that addressing this particular issue may not help us)

The only workaround I'm aware of is to disable deferral on operators.

waldoppper avatar Apr 26 '24 16:04 waldoppper

@paramjeet01, we found that restarting the scheduler enables the task to start.

waldoppper avatar Apr 30 '24 16:04 waldoppper

@paramjeet01 Its a bug in the Kubernetes executor. I am going to work on the fix for the same.

dirrao avatar May 10 '24 11:05 dirrao

@dirrao are you working on this issue?

eladkal avatar Jun 22 '24 19:06 eladkal

As per the Kubernetes executor, if the tasks that are stuck in a queued state and don't have associated tasks are rescheduled. If the tasks are stuck in a queued state and have associated pods, then those will be marked as failed. I believe this behviour seems to be correct.

@hussein-awala / @jedcunningham WDYT?

dirrao avatar Jun 23 '24 06:06 dirrao