hupper Ctrl-C hangs with Server did not exit, forcefully killing.

Hi,

Thanks for hupper, it's awesome !

I'm hitting a weird behaviour, similar to #47 . I'm using hupper in https://github.com/dalibo/temboard, using hupper API: https://github.com/dalibo/temboard/blob/c35895bf79a65758270de6fa5ea73869f17cd82f/agent/temboardagent/cli/app.py#L196-L210

In a debian container, Python python3.9, I enable hupper like this:

root@d303b92213a0:/var/lib/temboard-agent# sudo -Eu postgres DEBUG=y python3 -m  temboardagent
13:29:05 temboardagent[1239] INFO:  app: Starting temboard-agent 9.0.dev0.
13:29:05 temboardagent[1239] DEBUG:  taskmanager: Register worker discover
...
Starting monitor for PID 1243.
...
13:29:05 temboardagent[1243] INFO:  app: Starting temboard-agent 9.0.dev0.
...

Hot reloading works well. Here is the logs when I hit Ctrl-C:

^CReceived SIGINT, waiting for server to exit ...
13:29:08 temboardagent[1243] INFO:  services: Interrupted.
13:29:08 temboardagent[1257] INFO:  services: Interrupted.
13:29:08 temboardagent[1258] INFO:  services: Interrupted.
13:29:08 temboardagent[1258] INFO:  taskmanager: Aborting jobs.
13:29:08 temboardagent[1258] DEBUG:  services: Done. service=worker pool
13:29:08 temboardagent[1243] DEBUG:  services: Terminating background service. service=scheduler pid=1257
13:29:08 temboardagent[1243] DEBUG:  services: Terminating background service. service=worker pool pid=1258
13:29:09 temboardagent[1243] DEBUG:  services: Waiting background services.
13:29:09 temboardagent[1257] DEBUG:  services: Done. service=scheduler
Server did not exit, forcefully killing.

Then hupper hangs forever. I need to type Ctrl-C a second time. I tried to call graceful_shutdown() from a sigint_handler without success.

Do you have a clue on something wrong I do ?

Regards, Étienne

Jul 17 '24 13:07 bersace

It appears that your child process is receiving the SIGINT, shutting down, but not actually exiting. Thus hupper is then killing it after shutdown_interval time. Can you tell if the child's PID is gone after the "services: Done. service=scheduler" message, prior to the "Server did not exit, forcefully killing." message?

Jul 17 '24 18:07 mmerickel

It appears that your child process is receiving the SIGINT, shutting down, but not actually exiting. Thus hupper is then killing it after shutdown_interval time. Can you tell if the child's PID is gone after the "services: Done. service=scheduler" message, prior to the "Server did not exit, forcefully killing." message?

I'm pretty sure of that. Before SIGINT:

$ ps -efH
root          60       0  0 19:55 pts/0    00:00:00 /bin/bash
root         440      60  0 19:57 pts/0    00:00:00   sudo -Eu postgres DEBUG=y python3 -m temboardagent
postgres     441     440  6 19:57 pts/0    00:00:00     python3 -m temboardagent
postgres     445     441  6 19:57 pts/0    00:00:00       temboard-agent: postgres1: web
postgres     459     445  0 19:57 pts/0    00:00:00         temboard-agent: postgres1: scheduler
postgres     460     445  0 19:57 pts/0    00:00:00         temboard-agent: postgres1: worker pool
postgres     489     460  0 19:57 pts/0    00:00:00           temboard-agent: postgres1: task temboardagent.plugins.dashboard.dashboard_collect

temboard-agent: postgres1: web is the main process of the project.

After Ctrl-C in terminal:

$ ps -efH
root         440      60  0 19:57 pts/0    00:00:00   sudo -Eu postgres DEBUG=y python3 -m temboardagent
postgres     441     440  0 19:57 pts/0    00:00:00     python3 -m temboardagent

Regards, Étienne

Jul 17 '24 19:07 bersace

That's curious, there's no magic to what hupper is doing there.

So to unpack, there's 2 potential issues:

There's the graceful shutdown that isn't finishing successfully so you're ending up with a forceful kill.
And then the forceful kill is hanging.

For part 1, hupper is basically hanging onto a subprocess.Popen object and invoking .wait() and .poll() to determine if the child is dead. If it isn't dead after shutdown_interval time then it invokes a .kill() on that popen object.

For part 2, it sounds like you are telling me that the .kill() is hanging.

Does this sound right? I'm at a loss why, if you claim the PID is gone, that wait() would think it is still alive in the first place. And afterward, since it thinks it's still alive it tries a kill - I'm also at a loss why the kill would hang. But first, why does it think it's alive at all?

The main interesting thing that comes to mind is something about the fact that you're running a whole process tree under the reloader, not just a single python process. I wonder if that causes issues with the process group in some way.

Jul 18 '24 07:07 mmerickel