overmind icon indicating copy to clipboard operation
overmind copied to clipboard

Overmind does not detect crashed process

Open magJ opened this issue 1 year ago • 3 comments

I'm using overmind to run three processes, one of the processes, "api" a nodejs process ran out of memory and crashed. However overmind still thinks that it's running.

app-user@machine:/app$ overmind ps
PROCESS   PID       STATUS
nginx     341       running
worker    343       running
api       346       running
app-user@machine:/app$ ps aux|grep 346
app-user   346  0.0  0.0      0     0 ?        Zs   May09   0:00 [sh] <defunct>
app-user  1092  0.0  0.1   3328  1608 pts/2    S+   01:26   0:00 grep 346

It looks like the app process id "346" has become a zombie, but overmind has not detected it.

Overmind version: 2.4.0 Operating system: Debian bookworm, based off the docker image node:20.11.1-bookworm-slim, and running on fly.io

This issue happened on two different machines, but I'm really struggling to reproduce it. It might be a tmux issue, sounds similar to this https://github.com/tmux/tmux/issues/311 issue, but I really don't know.

magJ avatar May 10 '24 02:05 magJ

I ran into the same issue from time to time. Happened on earlier version of overmind, upgraded to latest 2.5.1 recently, still happening. I think the zombie process is the shell process, which in turns run the app process.

zhangcheng avatar May 24 '24 05:05 zhangcheng

I spent a day trying to debug this issue without much success, I suspect that it's a actually a tmux bug, but I haven't been able to figure out a reliable way to reproduce it.

magJ avatar May 24 '24 05:05 magJ

Hey there,

This definitely a bug of tmux not handling SIGCHLD properly.

From the Overmind's point of view, the process is still running since Overmind can send signals to it. The only way to check if a process is in the zombie state is to read its state file or to use the ps command. Both ways aren't pretty good to use them with short intervals. And I believe that it's not an imgproxy duty to kill zombies.

The walkaround proposed in https://github.com/tmux/tmux/issues/311 should theoretically work: prepend your commands with trap 'pkill -CHLD tmux' 0; or trap 'pkill -CHLD tmux' EXIT;.

To be honest, Overmind was never meant to run in production, it was developed mostly as a dev tool.

DarthSim avatar May 28 '24 16:05 DarthSim