Increment try_number for externally failed tasks
While working on PR #32646, I observed that when tasks are marked
failed externally like from the scheduler e.g. in the case when scheduler
marks the task as failed as it timed out being stuck in the queue, we set
the state of the task as failed but do not set the try_number for the
attempt. As a result, subsequent functionalities like downloading the logs,
seeing legacy logs don't work as the corresponding try_number is not
fetched and hence logs are not available for try_number=0 which is being
set for the task instance instead of the corresponding try_number based
on the attempt. The PR fixes this issue in handling external failures for
the task by setting the corresponding try_number.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.
Before
After
Thanks a ton to @utkarsharma2 for pairing and helping debug this here 🙂
@pankajkoti could you check the failed tests?
hi @pankajkoti can you look into the test failures and rebase?
hi @eladkal, yes, it might need more understanding on my end to understand the effects of try_number. This week looks tough, I just returned from my travel today and will need to catch up at work. I will try to find some time next week to work on this
Addresses a critical issue identified during the work on PR #32646. When tasks are externally marked as failed by the scheduler, such as in cases where a task times out while stuck in the queue, the current behavior sets the task state to 'failed' but neglects to assign the corresponding try_number for the attempt.
This omission creates problems for subsequent functionalities like log retrieval and viewing legacy logs, as the necessary try_number is not fetched. Consequently, logs are unavailable for try_number=0, which is incorrectly set for the task instance instead of the appropriate try_number based on the attempt.
This PR resolves the issue by ensuring that the corresponding try_number is correctly set when handling external failures for the task. Your review and feedback on this fix are appreciated. Please adhere to the Pull Request Guidelines. For fundamental code changes, consider initiating an Airflow Improvement Proposal (AIP). Also, check for compliance with the ASF 3rd Party License Policy in case of new dependencies. If there are potential backwards incompatible changes, leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, within the newsfragments directory. Thank you for your attention to these considerations.
tests need fixing
@pankajkoti are you still working on this? Looks like the tests are failing
Thanks for the reminders here, could not get to it earlier due to prirorities at work. Will work on fixing these over this weekend 🙏🏽
This seems to have been handled during the recent try_number work and I'm able to see this(what the PR aims to solve) working fine without the issue on main branch. Hence, closing this PR.