resource-agents icon indicating copy to clipboard operation
resource-agents copied to clipboard

ocf:heartbeat:nginx cannot restart after SIGKILL sent to nginx master

Open SpitchAG opened this issue 4 years ago • 1 comments

I think something is not right in this agent, because if you send a SIGKILL to master nginx, worker threads stay around and one of them starts listening to configured listen port, preventing nginx new master to start (bind error, address already used).

using the reuseport directive allows new master to start but then there is a leak of workers,

a quick workaround would be to fence the host on nginx start failure but hey if this can be avoided ...

SpitchAG avatar Sep 15 '20 19:09 SpitchAG

in the stop_nginx there is some code to try to kill remaining process, but the pgrep -f is a bit awkward, doesnt seem to grep anything as workers are not started with full cli args.

i did a quick workaround (in stop_nginx) by trying to find any nginx process listening on PORT (if provided in the crm resource config). If such a process exists i kill it, and lookup again until no workers listen to the port: (if the loop cannot be exited, stop will timeout and node will be fenced, eventually, maybe thats acceptable) this also assumes netstat is installed,

if [ -n "$PORT" ]; then while true; do pid=$(netstat -pnlt | grep ':$PORT' | grep nginx | awk '{ print $7 }' | awk -F/ '{ print $1 }') if [ -n "$pid" ]; then ocf_log warn "killing WORKER PID $pid" kill $pid sleep 1 else break fi done fi

seems to be fine in my setting dunno if there are use cases where this wont work,

SpitchAG avatar Sep 16 '20 13:09 SpitchAG