airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Increment try_number for externally failed tasks

Open pankajkoti opened this issue 2 years ago • 9 comments

While working on PR #32646, I observed that when tasks are marked failed externally like from the scheduler e.g. in the case when scheduler marks the task as failed as it timed out being stuck in the queue, we set the state of the task as failed but do not set the try_number for the attempt. As a result, subsequent functionalities like downloading the logs, seeing legacy logs don't work as the corresponding try_number is not fetched and hence logs are not available for try_number=0 which is being set for the task instance instead of the corresponding try_number based on the attempt. The PR fixes this issue in handling external failures for the task by setting the corresponding try_number.


^ Add meaningful description above

Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

pankajkoti avatar Jul 25 '23 09:07 pankajkoti

Before Screenshot 2023-07-25 at 2 41 25 PM

After Screenshot 2023-07-25 at 2 39 51 PM

pankajkoti avatar Jul 25 '23 09:07 pankajkoti

Thanks a ton to @utkarsharma2 for pairing and helping debug this here 🙂

pankajkoti avatar Jul 25 '23 09:07 pankajkoti

@pankajkoti could you check the failed tests?

hussein-awala avatar Aug 03 '23 20:08 hussein-awala

hi @pankajkoti can you look into the test failures and rebase?

eladkal avatar Oct 30 '23 06:10 eladkal

hi @eladkal, yes, it might need more understanding on my end to understand the effects of try_number. This week looks tough, I just returned from my travel today and will need to catch up at work. I will try to find some time next week to work on this

pankajkoti avatar Oct 30 '23 07:10 pankajkoti

Addresses a critical issue identified during the work on PR #32646. When tasks are externally marked as failed by the scheduler, such as in cases where a task times out while stuck in the queue, the current behavior sets the task state to 'failed' but neglects to assign the corresponding try_number for the attempt.

This omission creates problems for subsequent functionalities like log retrieval and viewing legacy logs, as the necessary try_number is not fetched. Consequently, logs are unavailable for try_number=0, which is incorrectly set for the task instance instead of the appropriate try_number based on the attempt.

This PR resolves the issue by ensuring that the corresponding try_number is correctly set when handling external failures for the task. Your review and feedback on this fix are appreciated. Please adhere to the Pull Request Guidelines. For fundamental code changes, consider initiating an Airflow Improvement Proposal (AIP). Also, check for compliance with the ASF 3rd Party License Policy in case of new dependencies. If there are potential backwards incompatible changes, leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, within the newsfragments directory. Thank you for your attention to these considerations.

itstalmeez avatar Dec 06 '23 20:12 itstalmeez

tests need fixing

potiuk avatar Dec 10 '23 22:12 potiuk

@pankajkoti are you still working on this? Looks like the tests are failing

amoghrajesh avatar Jan 18 '24 04:01 amoghrajesh

Thanks for the reminders here, could not get to it earlier due to prirorities at work. Will work on fixing these over this weekend 🙏🏽

pankajkoti avatar Jan 18 '24 04:01 pankajkoti

This seems to have been handled during the recent try_number work and I'm able to see this(what the PR aims to solve) working fine without the issue on main branch. Hence, closing this PR.

pankajkoti avatar Jun 03 '24 06:06 pankajkoti