Fix scope of backfill job raise TaskConcurrencyLimitReached
When a task raise TaskConcurrencyLimitReached its sibling task may be blocked, i think narrowing the scope of TaskConcurrencyLimitReached would solve this problem
related: #26078
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.
I'm wondering if we should mimic the scheduler on these limits?
@ephraimbuddy If max_active_tasks_per_dag is triggered (in scheduler_job), the next task will continue to execute instead of raise an exception
https://github.com/apache/airflow/blob/a901ad93faf80f0f4399154ac52920f9da1316e7/airflow/jobs/scheduler_job.py#L451-L463
@ephraimbuddy If max_active_tasks_per_dag is triggered (in scheduler_job), the next task will continue to execute instead of raise an exception
I meant what we have in SchedulerJob._executable_task_instances_to_queued. i.e starved_pool etc
@ephraimbuddy , Is it too expensive if just to fix this problem?
@ephraimbuddy , Is it too expensive if just to fix this problem?
I think what @ephraimbuddy suggest is to fix it properly - i.e. reuse the code that is used in scheduler. We are not only focusing on fixing the problems, but also want to make sure that they can be maintained in the future and avoid a lot of duplication in the logic. Seems that attempting to fix the issue by doing it in the same way as in the scheduler, fits that criteria.
Also speaking about maintainability, this code lacks any tests testing the behaviour (failing before/succeeding after the change) which makes it super-difficult to refactor the code later in case we decide to improve this part.
So I suggest to do both here:
- take a look at how scheaduler deals with it (potentially reusing code from it - maybe even extracting some common code)
- write a test to cover this case.
BTW. We almost never accept PRs that have no tests covering it, there must be a really good justification to not have them.
I think I might need some time to try and see if I can refactor the backfill...
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.