airflow Fix scope of backfill job raise TaskConcurrencyLimitReached

When a task raise TaskConcurrencyLimitReached its sibling task may be blocked, i think narrowing the scope of TaskConcurrencyLimitReached would solve this problem

related: #26078

^ Add meaningful description above

Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Sep 05 '22 17:09 chenglongyan

I'm wondering if we should mimic the scheduler on these limits?

Sep 06 '22 09:09 ephraimbuddy

@ephraimbuddy If max_active_tasks_per_dag is triggered (in scheduler_job), the next task will continue to execute instead of raise an exception

https://github.com/apache/airflow/blob/a901ad93faf80f0f4399154ac52920f9da1316e7/airflow/jobs/scheduler_job.py#L451-L463

Sep 07 '22 04:09 chenglongyan

@ephraimbuddy If max_active_tasks_per_dag is triggered (in scheduler_job), the next task will continue to execute instead of raise an exception

I meant what we have in SchedulerJob._executable_task_instances_to_queued. i.e starved_pool etc

Sep 07 '22 11:09 ephraimbuddy

@ephraimbuddy , Is it too expensive if just to fix this problem?

Sep 11 '22 18:09 chenglongyan

@ephraimbuddy , Is it too expensive if just to fix this problem?

I think what @ephraimbuddy suggest is to fix it properly - i.e. reuse the code that is used in scheduler. We are not only focusing on fixing the problems, but also want to make sure that they can be maintained in the future and avoid a lot of duplication in the logic. Seems that attempting to fix the issue by doing it in the same way as in the scheduler, fits that criteria.

Also speaking about maintainability, this code lacks any tests testing the behaviour (failing before/succeeding after the change) which makes it super-difficult to refactor the code later in case we decide to improve this part.

So I suggest to do both here:

take a look at how scheaduler deals with it (potentially reusing code from it - maybe even extracting some common code)
write a test to cover this case.

Sep 18 '22 20:09 potiuk

BTW. We almost never accept PRs that have no tests covering it, there must be a really good justification to not have them.

Sep 18 '22 21:09 potiuk

I think I might need some time to try and see if I can refactor the backfill...

Sep 20 '22 06:09 chenglongyan

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

Nov 05 '22 00:11 github-actions[bot]