nginx-gateway-fabric
nginx-gateway-fabric copied to clipboard
NGF Pod cannot recover if NGINX master process fails without cleaning up
Describe the bug
When the NGINX master process fails without cleaning up (kill -9 <nginx-master-pid>
), the NGF Pod cannot recover because the new NGINX container cannot start.
To Reproduce Steps to reproduce the behavior:
- Change
runAsNonRoot
fromtrue
tofalse
indeploy/manifests/nginx-gateway.yaml
- Deploy and expose NGF
- Insert an ephemeral container into the NGF Pod using this command:
kubectl debug -it -n nginx-gateway <NGF_POD> --image=busybox:1.28 --target=nginx-gateway
- Run
kill -9 <nginx-master-PID>
in the ephemeral container - Check the logs of the nginx container in a different terminal by running:
kubectl logs -f -n nginx-gateway <NGF_POD> -c nginx
Expected behavior The NGINX container should restart and the NGF Pod should recover.
Your environment
- Version of the NGINX Gateway Fabric -
"version":"edge","commit":"72b6c6ef8915c697626eeab88fdb6a3ce15b8da0"
- Version of Kubernetes - 1.27
- Kubernetes platform (e.g. Mini-kube or GCP) - GKE
- Details on how you expose the NGINX Gateway Fabric Pod - Loadbalancer
Additional context
Log file of nginx container showing error:
Another way to reproduce a similar error (majority of the time) is to
- Have NGF deployed on a Kind cluster
- Run
docker restart kind-control-plane
or manually restart the container in the Docker dashboard.
docker restart
will send a SIGTERM signal to the processes in the container, but will forcibly shutdown after a short time. I believe that this forced restart is the same as the NGINX master process fails without cleaning up.
Is this being worked on? Would it help if you'd take the fix from #1532 into the helm chart temporarily, with a flag in the values.yaml, until it's resolved in a better way?
Hey @AlexEndris, we actually haven't prioritized this because the only way we could cause it to happen is to kill the nginx process in the pod, which would be a highly unusual case.
I assume you are running into this issue yourself? Can you describe under what circumstances this occurs for you?
Yes. But I realise it might be a sort of edge case scenario that might not even happen. Essentially, we package a small k8s cluster using k3s. We shut it down and ship it. Upon restarting, nginx doesn't recover and we would need to manually kill the pod/restart the deployment to get it running again. The issue is, I don't have access to that, when it's shipped, and they want an out of the box experience.
Yes. But I realise it might be a sort of edge case scenario that might not even happen. Essentially, we package a small k8s cluster using k3s. We shut it down and ship it. Upon restarting, nginx doesn't recover and we would need to manually kill the pod/restart the deployment to get it running again. The issue is, I don't have access to that, when it's shipped, and they want an out of the box experience.
Thanks for the details @AlexEndris. I added this issue to our community triage meeting agenda scheduled for Monday. We will discuss it then. If you'd like to join, the meeting info is here.
@AlexEndris We discussed this during our community meeting and we think we can take a look at it in our next release. We'd like to first look at the fix in #1532 to see if we can solve the problem in the code, which shouldn't be too bad.
Once we do fix it, you can pull the edge release so you can get the fix before we do another full release if you're looking for something soon.
Thanks for letting us know!
Thank you very much! It's highly appreciated!
When completed, should remove the Skip()
of It("recovers when nginx container is restarted"...
added in https://github.com/nginxinc/nginx-gateway-fabric/pull/1832