nginx-gateway-fabric icon indicating copy to clipboard operation
nginx-gateway-fabric copied to clipboard

NGF Pod cannot recover if NGINX master process fails without cleaning up

Open bjee19 opened this issue 1 year ago • 8 comments

Describe the bug When the NGINX master process fails without cleaning up (kill -9 <nginx-master-pid>), the NGF Pod cannot recover because the new NGINX container cannot start.

To Reproduce Steps to reproduce the behavior:

  1. Change runAsNonRoot from true to false in deploy/manifests/nginx-gateway.yaml
  2. Deploy and expose NGF
  3. Insert an ephemeral container into the NGF Pod using this command: kubectl debug -it -n nginx-gateway <NGF_POD> --image=busybox:1.28 --target=nginx-gateway
  4. Run kill -9 <nginx-master-PID> in the ephemeral container
  5. Check the logs of the nginx container in a different terminal by running: kubectl logs -f -n nginx-gateway <NGF_POD> -c nginx

Expected behavior The NGINX container should restart and the NGF Pod should recover.

Your environment

  • Version of the NGINX Gateway Fabric - "version":"edge","commit":"72b6c6ef8915c697626eeab88fdb6a3ce15b8da0"
  • Version of Kubernetes - 1.27
  • Kubernetes platform (e.g. Mini-kube or GCP) - GKE
  • Details on how you expose the NGINX Gateway Fabric Pod - Loadbalancer

Additional context Log file of nginx container showing error: image

bjee19 avatar Oct 02 '23 22:10 bjee19

Another way to reproduce a similar error (majority of the time) is to

  1. Have NGF deployed on a Kind cluster
  2. Run docker restart kind-control-plane or manually restart the container in the Docker dashboard.

docker restart will send a SIGTERM signal to the processes in the container, but will forcibly shutdown after a short time. I believe that this forced restart is the same as the NGINX master process fails without cleaning up.

bjee19 avatar Oct 05 '23 20:10 bjee19

Is this being worked on? Would it help if you'd take the fix from #1532 into the helm chart temporarily, with a flag in the values.yaml, until it's resolved in a better way?

AlexEndris avatar Mar 06 '24 14:03 AlexEndris

Hey @AlexEndris, we actually haven't prioritized this because the only way we could cause it to happen is to kill the nginx process in the pod, which would be a highly unusual case.

I assume you are running into this issue yourself? Can you describe under what circumstances this occurs for you?

mpstefan avatar Mar 06 '24 20:03 mpstefan

Yes. But I realise it might be a sort of edge case scenario that might not even happen. Essentially, we package a small k8s cluster using k3s. We shut it down and ship it. Upon restarting, nginx doesn't recover and we would need to manually kill the pod/restart the deployment to get it running again. The issue is, I don't have access to that, when it's shipped, and they want an out of the box experience.

AlexEndris avatar Mar 06 '24 20:03 AlexEndris

Yes. But I realise it might be a sort of edge case scenario that might not even happen. Essentially, we package a small k8s cluster using k3s. We shut it down and ship it. Upon restarting, nginx doesn't recover and we would need to manually kill the pod/restart the deployment to get it running again. The issue is, I don't have access to that, when it's shipped, and they want an out of the box experience.

Thanks for the details @AlexEndris. I added this issue to our community triage meeting agenda scheduled for Monday. We will discuss it then. If you'd like to join, the meeting info is here.

kate-osborn avatar Mar 06 '24 22:03 kate-osborn

@AlexEndris We discussed this during our community meeting and we think we can take a look at it in our next release. We'd like to first look at the fix in #1532 to see if we can solve the problem in the code, which shouldn't be too bad.

Once we do fix it, you can pull the edge release so you can get the fix before we do another full release if you're looking for something soon.

Thanks for letting us know!

mpstefan avatar Mar 11 '24 16:03 mpstefan

Thank you very much! It's highly appreciated!

AlexEndris avatar Mar 12 '24 12:03 AlexEndris

When completed, should remove the Skip() of It("recovers when nginx container is restarted"... added in https://github.com/nginxinc/nginx-gateway-fabric/pull/1832

bjee19 avatar May 01 '24 15:05 bjee19