airflow Add SIGHUP signal before SIGTERM for task runner termination

This PR will add the ideal signal sequence for task runner termination mentioned here: https://github.com/apache/airflow/issues/35474#issuecomment-1801988073.

^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Aug 08 '24 10:08 molcay

Tests are failing. Can you fix them?

FAILED tests/jobs/test_local_task_job.py::TestLocalTaskJob::test_mark_success_no_kill - AttributeError: 'Process' object has no attribute 'id'
FAILED tests/jobs/test_local_task_job.py::TestLocalTaskJob::test_local_task_return_code_metric - AttributeError: 'Process' object has no attribute 'id'
FAILED tests/jobs/test_local_task_job.py::TestLocalTaskJob::test_localtaskjob_maintain_heart_rate - AttributeError: 'Process' object has no attribute 'id'
FAILED tests/jobs/test_local_task_job.py::TestLocalTaskJob::test_mark_failure_on_failure_callback - AttributeError: 'Process' object has no attribute 'id'
FAILED tests/jobs/test_local_task_job.py::TestLocalTaskJob::test_dagrun_timeout_logged_in_task_logs - AttributeError: 'Process' object has no attribute 'id'
FAILED tests/jobs/test_local_task_job.py::TestLocalTaskJob::test_mark_success_on_success_callback - AttributeError: 'Process' object has no attribute 'id'
FAILED tests/jobs/test_local_task_job.py::TestLocalTaskJob::test_success_slow_listeners_executed_kill - AttributeError: 'Process' object has no attribute 'id'
FAILED tests/jobs/test_local_task_job.py::TestLocalTaskJob::test_success_slow_task_not_killed_by_overtime_but_regular_timeout - AttributeError: 'Process' object has no attribute 'id'
FAILED tests/jobs/test_local_task_job.py::TestLocalTaskJob::test_process_sigsegv_error_message - AttributeError: 'Process' object has no attribute 'id'
FAILED tests/jobs/test_local_task_job.py::TestSigtermOnRunner::test_process_sigterm_works_with_retries[fork-daemon] - airflow.exceptions.AirflowTaskTimeout: Timeout during waiting callback, PID: 83
FAILED tests/jobs/test_local_task_job.py::TestSigtermOnRunner::test_process_sigterm_works_with_retries[fork-non-daemon] - airflow.exceptions.AirflowTaskTimeout: Timeout during waiting callback, PID: 83
FAILED tests/jobs/test_local_task_job.py::TestSigtermOnRunner::test_process_sigterm_works_with_retries[spawn-daemon] - airflow.exceptions.AirflowTaskTimeout: Timeout during waiting callback, PID: 83
FAILED tests/jobs/test_local_task_job.py::TestSigtermOnRunner::test_process_sigterm_works_with_retries[spawn-non-daemon] - airflow.exceptions.AirflowTaskTimeout: Timeout during waiting callback, PID: 83

Aug 08 '24 17:08 eladkal

The PR needs to be rebased and conflicts removed - it's actually 149 commits behind main, so rebasing is a good thing.

Aug 12 '24 13:08 potiuk

Hi @eladkal, I fixed the first part of the failing tests but I did not find a solution for the last 4. If you have any idea how I can approach, please feel free to share them with me.

Hi @potiuk, I rebased and pushed the code again. thank you for the warning

Aug 13 '24 07:08 molcay

hi @kaxil @potiuk ! Can you please take a look on the changes? Thanks you!

Aug 14 '24 09:08 VladaZakharova

IF I am not mistaken - this one also adds timeout in supervising process. Was that intentional ? (the commint message only mentions adding SIGHUP/SIGTERM) ?

Aug 16 '24 17:08 potiuk

Hi @potiuk, Yes, it was on purpose. I was reading the first discussion and there is this comment written by Taragolis: https://github.com/apache/airflow/issues/35474#issuecomment-1801897704 I was trying to implement the 2nd step also.

If you think this is too complicated for single commit, I can split the change into two commits:

1 commit for the changes in signaling (reap_process_group function)
1 commit for the changes in local_task_job_runner.py.

Also, One test case (with 4 different parameter groups) is failing. I could not fix the problem with that test case. If you can give me some help, would be very much appreciated!

Aug 22 '24 08:08 molcay

Hmm. I am not sure @ashb - how much this one will be affected by AIP-72? It's quite likely (almost certain) that starting task will be heavily affected in Airflow 3, and since this change is really a new feature. possibly we should defer it till AIP-72 start taking shape.

Aug 25 '24 13:08 potiuk

@ashb?

Aug 25 '24 13:08 potiuk

Yeah the task runner interface is being completed rewritten (poc almost far enough along that I know where it will go and start prs)

I'll also need to re-read the motivation for this as I think right now we do INT -> TERM, so I'll need to see what difference HUP instead would make

Aug 25 '24 13:08 ashb

Hi @potiuk @ashb ! Can you share please some more details about AIP-72? As i can see, there was a page dedicated to some proposals but not implementation itself. We've spent quite some time implementing this, so maybe we can discuss the way our implementation can be used based on your proposals? :)

Aug 26 '24 09:08 VladaZakharova

Hi @potiuk @ashb ! Can you share please some more details about AIP-72? As i can see, there was a page dedicated to some proposals but not implementation itself. We've spent quite some time implementing this, so maybe we can discuss the way our implementation can be used based on your proposals? :)

We are waiting for @ashb POC implementation on that one.

Sep 01 '24 21:09 potiuk

Hi @ashb ! Do you have some estimation for implementation of this logic?

Sep 02 '24 09:09 VladaZakharova

Hi @potiuk @ashb ! Can we please check the changes again? Is there a way for someone else to review them ?:)

Sep 19 '24 09:09 VladaZakharova

Hi @potiuk @ashb ! Can we please check the changes again? Is there a way for someone else to review them ?:)

I think it would really be sensible to wait for the new implementation to be merged and do it then. It's kinda waste of time (especially that tests are failing) to get it merged before.

Oct 02 '24 00:10 potiuk

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

Nov 19 '24 00:11 github-actions[bot]

removed stale.

Nov 19 '24 04:11 potiuk

This PR will add the ideal signal sequence for task runner termination mentioned here: https://github.com/apache/airflow/issues/35474#issuecomment-1801988073.

@potiuk Why do we want to send a SIGHUP first? That seems like a very odd signal to send. SIGINT followed by SIGTERM is what Docker/podman/every other container runtime(?) does and feels like a much more normal sequence of signals. Especially given that SIGINT being sent to a python process will raise a KeyboardInterupt signal won't it, where as by default SIGHUP would just kill the process outright (as it doesn't have a default signal handler, so default behaviour is to just exit!)

@molcay The new implementation has now been merged and lives in https://github.com/apache/airflow/blob/main/task_sdk/src/airflow/sdk/execution_time/supervisor.py (as of today only LocalExecutor has been ported over to use this new code path, other executors will follow soon.) However before spending time to update this PR I'd like Jarek to answer the above Q.

Nov 28 '24 13:11 ashb

@potiuk Why do we want to send a SIGHUP first? That seems like a very odd signal to send. SIGINT followed by SIGTERM is what Docker/podman/every other container runtime(?) does and feels like a much more normal sequence of signals. Especially given that SIGINT being sent to a python process will raise a KeyboardInterupt signal won't it, where as by default SIGHUP would just kill the process outright (as it doesn't have a default signal handler, so default behaviour is to just exit!)

Yes, we could skip SIGHUP, but whatever we do we should clearly document our signal escalation policy. I am not very strong on it because I don't think there is a default and always followed path of escalation of signals. HUP/TERM/INT is what I've been using in the past for all kind of processes, because this is what most processes which have not been "deamon desgined" will normally experience in their lifecycle.

Generally speaking docker sends just "TERM/KILL" sequence (without INT) - docker stop (but you can change the TERM by --signal NN or you can instruct docker to use another signal by declaring STOPSIGNAL in your Dockerfile or when you build or run your container you can override the default TERM: https://docs.docker.com/reference/cli/docker/container/stop/ - so they are really not doing signal escalation except very basic one.

More on the why I proposed HUP/TERM/INT/KILL over TERM/INT/KILL:

HUP is a signal that is always sent when your terminal disconnects - so if you have any CLI that you run via AIRFLOW, they should be prepared to handle it as the first thing that happens. Most of the daemons will reload configuration when they receive HUP, most of the CLI programs will handle it exactly the way as if the user disconnected - if they are prepared for being disconnected, they will detach the stdin/stdout and stop writing to it basically (and few other things - switching to "non-interactive mode"). If not then the programs will stop (and usually in some graceful way if possible). But all the apps generally should have a reasonable response to it. Also HUP is something that is not a deliberate action of the user - the idea behind HUP that this is really "what happens when your connection from terminal breakes"

I think this is why Docker does not really send HUP, because by default there is no terminal connected to the running app (you have to run your container with -t to get the terminal connected).

I think airlfow is in a bit different league though - many of the tasks our users run as "tasks" are really "interactive" CLIs wrapped in airflow tasks, rather than long running docker processes (which is the default mode of docker).

But - not so strong about it (see INT below).

INT is the weak interruption that is sent to non-interactive programs (or those who switched from interactive to non-interactive mode. It's the same which is sent with Ctrl-C, which means that it is deliberate action of the user. And yes - I think what we have here is deliberate action from the "user" ("airflow worker" in this case). So maybe we indeed want to skip the HUP one.
TERM is regular, stronger kill request

We could also consider QUIT after TERM

QUIT - quit is the "application misbehaves and we really want to quit it". The advantage of using it is that by default in C programs (also Python) it should produce debuggable core-dump. Which is a good idea I think if the process did not respond in reasonable time for TERM. In case of docker, it makes little sense because generally such core-dump would be lost when container exits, so maybe that's why they are not using it.
KILL - of course, non-ignorable last resort signal

So maybe what we really want is INT/TERM/QUIT/KILL ?

Nov 28 '24 13:11 potiuk

My default approach here would be to follow what containers to as these days it's an expected pattern, so TERM -> KILL after timeout. and no other signals. I don't personally see much value in a second quit signal, if it doesn't finish after the first one (wether that is INT, TERM or QUIT) I don't think there's really much value in sending it a second signal -- either it was well-behaved and is busy (in which case what would the task do? Try really hard to stop :grin:?) or it ignored the signal and will ignore this one tho.

So my vote is just TERM -> ...timeout ... -> KILL ala Docker.

Good point about INT on docker, I was remembering wrongly.

Nov 28 '24 14:11 ashb

So my vote is just TERM -> ...timeout ... -> KILL ala Docker.

The simpler, the better, yes. I am ok with it - as long as we also document the reasoning and the fact that we took deliberate decision on it.

The second part is more important than the choice we make as it will effectively become part of the Public Airlfow Interface - users have to know what to expect.

Nov 28 '24 17:11 potiuk

Also we should document (alongside the escalation of signals) whether we send signal to the process or to process group and whether we have separate process group.

This makes a lot of difference.

For example we have whole section of it and handling it differently for celery workers because signal propagation for Celery should work differently than for most other cases https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation (in celery we cannot send signals in docker to the whole process group for example if we want celery to handle putting the workers in offline state, but for all other containers we want to send signals to the whole process group in order to make sure we kill everything regardless if signals are properly propagated - for example to handle bash that does not propagate signals to it's children).

So we need to make deliberate decision on that one as well. @ashb - what's your thinking on process groups and signal propagation?

I personally think we should create a new process group when we create "task" process and send the signal to the whole process group, not just to the process. WDYT?

Nov 28 '24 17:11 potiuk

@kaxil has made a start on a decent TERM->KILL in https://github.com/apache/airflow/pull/44465, though yes, we will need to document this and handle process groups etc.

Nov 28 '24 22:11 ashb

This PR will add the ideal signal sequence for task runner termination mentioned here: #35474 (comment).

@potiuk Why do we want to send a SIGHUP first? That seems like a very odd signal to send. SIGINT followed by SIGTERM is what Docker/podman/every other container runtime(?) does and feels like a much more normal sequence of signals. Especially given that SIGINT being sent to a python process will raise a KeyboardInterupt signal won't it, where as by default SIGHUP would just kill the process outright (as it doesn't have a default signal handler, so default behaviour is to just exit!)

@molcay The new implementation has now been merged and lives in https://github.com/apache/airflow/blob/main/task_sdk/src/airflow/sdk/execution_time/supervisor.py (as of today only LocalExecutor has been ported over to use this new code path, other executors will follow soon.) However before spending time to update this PR I'd like Jarek to answer the above Q.

I read the discussion and thank you for the discussion, it was illuminating for me :) As far as I understand, there is no need for this implementation (after #44465), right?

Dec 04 '24 19:12 molcay

I read the discussion and thank you for the discussion, it was illuminating for me :) As far as I understand, there is no need for this implementation (after #44465), right?

Indeed - we can close it. There is still open point about sending signals to process group and creating new process group, in order to handle things like bash not propagating signals by default. https://github.com/apache/airflow/pull/44465#discussion_r1863114700 (I am not sure if there was a new PR for that - I am catching up with opened PRs after some private errands).

Dec 06 '24 07:12 potiuk