Ctrl-C hangs with Server did not exit, forcefully killing.
Hi,
Thanks for hupper, it's awesome !
I'm hitting a weird behaviour, similar to #47 . I'm using hupper in https://github.com/dalibo/temboard, using hupper API: https://github.com/dalibo/temboard/blob/c35895bf79a65758270de6fa5ea73869f17cd82f/agent/temboardagent/cli/app.py#L196-L210
In a debian container, Python python3.9, I enable hupper like this:
root@d303b92213a0:/var/lib/temboard-agent# sudo -Eu postgres DEBUG=y python3 -m temboardagent
13:29:05 temboardagent[1239] INFO: app: Starting temboard-agent 9.0.dev0.
13:29:05 temboardagent[1239] DEBUG: taskmanager: Register worker discover
...
Starting monitor for PID 1243.
...
13:29:05 temboardagent[1243] INFO: app: Starting temboard-agent 9.0.dev0.
...
Hot reloading works well. Here is the logs when I hit Ctrl-C:
^CReceived SIGINT, waiting for server to exit ...
13:29:08 temboardagent[1243] INFO: services: Interrupted.
13:29:08 temboardagent[1257] INFO: services: Interrupted.
13:29:08 temboardagent[1258] INFO: services: Interrupted.
13:29:08 temboardagent[1258] INFO: taskmanager: Aborting jobs.
13:29:08 temboardagent[1258] DEBUG: services: Done. service=worker pool
13:29:08 temboardagent[1243] DEBUG: services: Terminating background service. service=scheduler pid=1257
13:29:08 temboardagent[1243] DEBUG: services: Terminating background service. service=worker pool pid=1258
13:29:09 temboardagent[1243] DEBUG: services: Waiting background services.
13:29:09 temboardagent[1257] DEBUG: services: Done. service=scheduler
Server did not exit, forcefully killing.
Then hupper hangs forever. I need to type Ctrl-C a second time.
I tried to call graceful_shutdown() from a sigint_handler without success.
Do you have a clue on something wrong I do ?
Regards, Étienne
It appears that your child process is receiving the SIGINT, shutting down, but not actually exiting. Thus hupper is then killing it after shutdown_interval time. Can you tell if the child's PID is gone after the "services: Done. service=scheduler" message, prior to the "Server did not exit, forcefully killing." message?
It appears that your child process is receiving the SIGINT, shutting down, but not actually exiting. Thus hupper is then killing it after shutdown_interval time. Can you tell if the child's PID is gone after the "services: Done. service=scheduler" message, prior to the "Server did not exit, forcefully killing." message?
I'm pretty sure of that. Before SIGINT:
$ ps -efH
root 60 0 0 19:55 pts/0 00:00:00 /bin/bash
root 440 60 0 19:57 pts/0 00:00:00 sudo -Eu postgres DEBUG=y python3 -m temboardagent
postgres 441 440 6 19:57 pts/0 00:00:00 python3 -m temboardagent
postgres 445 441 6 19:57 pts/0 00:00:00 temboard-agent: postgres1: web
postgres 459 445 0 19:57 pts/0 00:00:00 temboard-agent: postgres1: scheduler
postgres 460 445 0 19:57 pts/0 00:00:00 temboard-agent: postgres1: worker pool
postgres 489 460 0 19:57 pts/0 00:00:00 temboard-agent: postgres1: task temboardagent.plugins.dashboard.dashboard_collect
temboard-agent: postgres1: web is the main process of the project.
After Ctrl-C in terminal:
$ ps -efH
root 440 60 0 19:57 pts/0 00:00:00 sudo -Eu postgres DEBUG=y python3 -m temboardagent
postgres 441 440 0 19:57 pts/0 00:00:00 python3 -m temboardagent
Regards, Étienne
That's curious, there's no magic to what hupper is doing there.
So to unpack, there's 2 potential issues:
- There's the graceful shutdown that isn't finishing successfully so you're ending up with a forceful kill.
- And then the forceful kill is hanging.
For part 1, hupper is basically hanging onto a subprocess.Popen object and invoking .wait() and .poll() to determine if the child is dead. If it isn't dead after shutdown_interval time then it invokes a .kill() on that popen object.
For part 2, it sounds like you are telling me that the .kill() is hanging.
Does this sound right? I'm at a loss why, if you claim the PID is gone, that wait() would think it is still alive in the first place. And afterward, since it thinks it's still alive it tries a kill - I'm also at a loss why the kill would hang. But first, why does it think it's alive at all?
The main interesting thing that comes to mind is something about the fact that you're running a whole process tree under the reloader, not just a single python process. I wonder if that causes issues with the process group in some way.