airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Tasks fails without logs under heavy load

Open team-hawking-stam opened this issue 1 year ago • 1 comments

Apache Airflow version

2.10.4

If "Other Airflow 2 version" selected, which one?

No response

What happened?

I have multiple dag_run of a dag, running parallel on a kubernetes cluster with a single worker pod. I use 16 as parallelism and a retry_count of 4. This dags is composed of mapped_tasks. The bigger one spawns 36 mapped task. Every day 100 dag_run will be spawned toghter and the dag_run with most task will fail with 3/4 mapped tasks failed. Those tasks fails after 4 retry, but most of the times i see only 1 or 2 logs of execution. Most of the time the log is :

[2024-12-19T11:06:21.433+0000] {local_task_job_runner.py:123} INFO - ::group::Pre task execution logs
[2024-12-19T11:06:21.792+0000] {taskinstance.py:2603} INFO - Dependencies not met for <TaskInstance: ExportSii.XMLGeneration manual__2024-12-19T11:05:23.599025+00:00 map_index=9 [up_for_retry]>, dependency 'Not In Retry Period' FAILED: Task is not ready for retry yet but will be retried automatically. Current date is 2024-12-19T11:06:21.791874+00:00 and task will be retried at 2024-12-19T11:06:44.472566+00:00.
[2024-12-19T11:06:21.805+0000] {local_task_job_runner.py:166} INFO - Task is not able to be run

This for example is attempt=2.log and i dont have 1,3 or 4. Neither in logs or in the UI.

Then when I clear the state of failed tasks they will run correctly without errors.

What you think should happen instead?

I would like to see all the attempt, and a more clear trace of what happened so i can debug the problem.

How to reproduce

It's mostly dependent on the workload. On another istance with the same code, but less stress it doesn't happen.

Operating System

helm-chart on kubernetes

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Kubernetes on GKE

Anything else?

I dont have any error at kubernetes level or on the worker log

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

team-hawking-stam avatar Dec 19 '24 13:12 team-hawking-stam

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

boring-cyborg[bot] avatar Dec 19 '24 13:12 boring-cyborg[bot]

It is not as clear from the description how to reproduce the issue so others could resolve it - if you could provide a minimal example for reproducing it, it would be helpful.

shahar1 avatar Jan 10 '25 08:01 shahar1

It seems like a duplicate of https://github.com/apache/airflow/issues/42107, I'm closing this one and we could continue the discussion there.

shahar1 avatar Jan 10 '25 09:01 shahar1