airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Update standard_task_runner.py

Open RNHTTR opened this issue 1 year ago • 7 comments

This can be caused by any number of things. Suggesting OOM can send a user down a troubleshooting path that is not helpful.

RNHTTR avatar May 10 '24 12:05 RNHTTR

This can be caused by any number of things.

Maybe instead of remove describe that kind of things and how it could be analysed.

Taragolis avatar May 10 '24 12:05 Taragolis

Yep. I think this is is an oportunity to send a user on a GOOD path (or at least explain various reasons and possibly send the user to some troubleshooting documentation). Otherwise this new message gives absolutely no clue whatsoever (which will mean they will come an open issue because they will have no idea what to do). The OOM at least could get the user go to A direction where they started looking at possible reasons and maybe they could find something.

potiuk avatar May 10 '24 13:05 potiuk

Yep. I think this is is an oportunity to send a user on a GOOD path (or at least explain various reasons and possibly send the user to some troubleshooting documentation). Otherwise this new message gives absolutely no clue whatsoever (which will mean they will come an open issue because they will have no idea what to do). The OOM at least could get the user go to A direction where they started looking at possible reasons and maybe they could find something.

I don't have a good recommendation. A more likely path for OOM (at least in my experience) is the task becoming a zombie rather than Airflow failing the task in a more typical manner. I've seen this error message somewhat frequently, but I don't know that I've ever seen it as a result of OOM.

RNHTTR avatar May 16 '24 17:05 RNHTTR

@amoghrajesh had a good response to a user in Slack:

-9 is a SIGKILL. You can handle using using the on_kill callback in your Operator.

Perhaps it could recommend that users add an on_kill operator?

RNHTTR avatar May 16 '24 20:05 RNHTTR

@RNHTTR good recommendation. What if we document that better, or do something even better. In the on_kill callback for the baseoperator, we can add enough information for the users to send them in various debugging paths. TLDR; Add all the possible causes we can think of there. Since on_kill callback will only be called in case of, well kill.

    def on_kill(self):
        self.log.info("SIGKILL was called. It could be because of: a)...b)....")

amoghrajesh avatar May 17 '24 04:05 amoghrajesh

@RNHTTR good recommendation. What if we document that better, or do something even better. In the on_kill callback for the baseoperator, we can add enough information for the users to send them in various debugging paths. TLDR; Add all the possible causes we can think of there. Since on_kill callback will only be called in case of, well kill.

    def on_kill(self):
        self.log.info("SIGKILL was called. It could be because of: a)...b)....")

SIGKILL will ever trigger the on_kill. The -9 signal is not possible to handle really in "on_kill" - this is why we are guessing here why processes were killed. The "on_kill" method name has really no relation to SIGKILL (-9) - it's called when the task was stopped more gracefully rather by -9.

I think the right approach is to explain more what happens - current description is rather vague. Here that the task process was killed externally by -9, and have possible reasons why it might happen. OOM is one of the reasons, but there are other reasons - for example when machine/pod is evicted, -9 might be sent to all the processes when they are not responsive to other attempts to kill. I think it would be great maybe to get a little more description on all that and give the user some direction to look for - usually it's a signal sent by the deployment (K8S) but likely there might be other reasons - I think also Airflow standard task runner heartbeat might actually sigkill such process if it becomes unresponsive (and likely there is another log written in this case somewhere) - it would be worth to check it. So, just a few things listed here as possible reasons (and making sure it is open-ended) could be useful. Maybe even somewhere in our FAQ we should have a section "why my task can get sig-killed" and do a bit more description there.

potiuk avatar May 19 '24 15:05 potiuk

Thanks for the clarification @potiuk I was under the impression that -9 is caught in on_kill. I agree with your suggestions and we should mention some of these to provide a debugging direction. It can have things like:

  1. The task might have been killed due to OOM
  2. The task might have been killed due to a bad underlying host
  3. The task could have been killed by your deployment manager due to x, y reasons.

amoghrajesh avatar May 20 '24 04:05 amoghrajesh

@potiuk I thought about this a little more. I think we should keep a note about OOMKill for -9 (and maybe add a blurb about other things that could cause -9 like you suggest), but we should replace log something different for when the return code is None. In this case, we should simply indicate that the task was killed for some unknown reason. I think that just assuming -9 is misleading, and causes more confusion. What do you think?

RNHTTR avatar Jun 09 '24 17:06 RNHTTR

@potiuk I thought about this a little more. I think we should keep a note about OOMKill for -9 (and maybe add a blurb about other things that could cause -9 like you suggest), but we should replace log something different for when the return code is None. In this case, we should simply indicate that the task was killed for some unknown reason. I think that just assuming -9 is misleading, and causes more confusion. What do you think?

I think anything where we have a space (in our docs) where we can direct user (via link) where they look for a problem is good. Even if it is incomplete but says "those can be the reasons by there are more" is way better than anything that gives the user no clue whatsoever. We can add more stuff there over time even if initial assesment is not complete, every single time when we discuss with user and find another reason we can update that documentation and make it better. If another committer looks at it and they have no clue, they can also learn from that information - that's why it should provide context and where the error might be generated from. I think just providing the log with explaining WHAT happened without telling context WHY it happend and HOW they can fix the problem will inevitably lead to the users asking on the issues or discussions what to do. And our goal should be:

a) either they find a possible cause by following the docs b) or when they post issue or discussion, other users will follow the docs and advise them c) a manintainer finds the root cause by investigation and not only tell it to the user but also update the documentation to include that reason

potiuk avatar Jun 09 '24 20:06 potiuk

Based on the feedback here and the feedback in #40141 , I'm going to add a troubleshooting obscure task failures docs page and link out to that page for this log message as well as the one in #40141.

See #40334

RNHTTR avatar Jun 19 '24 15:06 RNHTTR