Tasks fails without logs under heavy load
Apache Airflow version
2.10.4
If "Other Airflow 2 version" selected, which one?
No response
What happened?
I have multiple dag_run of a dag, running parallel on a kubernetes cluster with a single worker pod. I use 16 as parallelism and a retry_count of 4. This dags is composed of mapped_tasks. The bigger one spawns 36 mapped task. Every day 100 dag_run will be spawned toghter and the dag_run with most task will fail with 3/4 mapped tasks failed. Those tasks fails after 4 retry, but most of the times i see only 1 or 2 logs of execution. Most of the time the log is :
[2024-12-19T11:06:21.433+0000] {local_task_job_runner.py:123} INFO - ::group::Pre task execution logs
[2024-12-19T11:06:21.792+0000] {taskinstance.py:2603} INFO - Dependencies not met for <TaskInstance: ExportSii.XMLGeneration manual__2024-12-19T11:05:23.599025+00:00 map_index=9 [up_for_retry]>, dependency 'Not In Retry Period' FAILED: Task is not ready for retry yet but will be retried automatically. Current date is 2024-12-19T11:06:21.791874+00:00 and task will be retried at 2024-12-19T11:06:44.472566+00:00.
[2024-12-19T11:06:21.805+0000] {local_task_job_runner.py:166} INFO - Task is not able to be run
This for example is attempt=2.log and i dont have 1,3 or 4. Neither in logs or in the UI.
Then when I clear the state of failed tasks they will run correctly without errors.
What you think should happen instead?
I would like to see all the attempt, and a more clear trace of what happened so i can debug the problem.
How to reproduce
It's mostly dependent on the workload. On another istance with the same code, but less stress it doesn't happen.
Operating System
helm-chart on kubernetes
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Kubernetes on GKE
Anything else?
I dont have any error at kubernetes level or on the worker log
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
It is not as clear from the description how to reproduce the issue so others could resolve it - if you could provide a minimal example for reproducing it, it would be helpful.
It seems like a duplicate of https://github.com/apache/airflow/issues/42107, I'm closing this one and we could continue the discussion there.