ingress-nginx Possible race condition in startup and readiness probe

What happened:

In the sample baremetal deploy it's suggested to have a deployment with the following readiness probe:

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

while updating some ingress-nginx settings with

  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

I accidentally could catch the

➜ curl -v -k https://<redacted ip>
*   Trying <redacted ip>:443...
* TCP_NODELAY set
* connect to <redacted ip> port 443 failed: Connection refused
* Failed to connect to <redacted ip> port 443: Connection refused
* Closing connection 0
curl: (7) Failed to connect to <redacted ip> port 443: Connection refused

which led me to a thought that possibly, just possibly it's a sign that healthz responded before the nginx has bound to the 443 port and started listening it.

Do you think there is a chance?

If so - may be readiness probe should be switched to a tcp port check (443)?

What you expected to happen:

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller
  Release:       v1.3.0
  Build:         2b7b74854d90ad9b4b96a5011b9e8b67d20bfb8f
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.10

Kubernetes version (use kubectl version): v1.23.8

Environment:

Cloud provider or hardware configuration: bare metal, with metal-lb
OS (e.g. from /etc/os-release): ubuntu 20.04
Kernel (e.g. uname -a): 5.4.0-120-generic #136-Ubuntu SMP Fri Jun 10 13:40:48 UTC 2022 x86_64 Linux
Install tools:
- Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
Basic cluster related info:
- kubectl version
- kubectl get nodes -o wide
How was the ingress-nginx-controller installed:
- If helm was used then please show output of helm ls -A | grep -i ingress
- If helm was used then please show output of helm -n <ingresscontrollernamepspace> get values <helmreleasename>
- If helm was not used, then copy/paste the complete precise command used to install the controller, along with the flags and options used
- if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances
Current State of the controller:
- kubectl describe ingressclasses
- kubectl -n <ingresscontrollernamespace> get all -A -o wide
- kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
- kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
Current state of ingress object, if applicable:
- kubectl -n <appnnamespace> get all,ing -o wide
- kubectl -n <appnamespace> describe ing <ingressname>
- If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag
Others:
- Any other related information like ;
  - copy/paste of the snippet (if applicable)
  - kubectl describe ... of any custom configmap(s) created and in use
  - Any other related information that may help

How to reproduce this issue:

Deploy ingress-nginx with replicas=2 maxUnavailable=1, change deployment configuration and while it rolls out the new pods bombard the pods with requests.

Anything else we need to know:

Jul 29 '22 04:07 zerkms

@zerkms: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jul 29 '22 04:07 k8s-ci-robot

/remove-kind bug

I am unable to reproduce this ;

Jul 29 '22 06:07 longwuyuan

@longwuyuan it's not wondering:

Your demo ingress-nginx is empty, it takes much less time for nginx to start
You emit requests very infrequently - once ever 1 or 2 seconds, if there is race condition - it would be milliseconds

Does the code guarantee healthz port only serves after nginx has fully initialised and is listening?

Jul 29 '22 20:07 zerkms

In source code I found this:

https://github.com/kubernetes/ingress-nginx/blob/a581a7bebc1f4ff028f1e57dca0ce95abef78c62/cmd/dataplane/main.go#L92-L96
https://github.com/kubernetes/ingress-nginx/blob/a581a7bebc1f4ff028f1e57dca0ce95abef78c62/cmd/nginx/main.go#L152-L163

which starts healthz handler concurrently to nginx, with no synchronisation whatsoever. Hence a question: how healthz may be used as a readiness check?

Jul 29 '22 20:07 zerkms

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 27 '22 23:10 k8s-triage-robot

/remove-lifecycle stale

Oct 31 '22 00:10 zerkms

I get the same behavior when using Tilt to deploy my applications locally, one of which uses ingress-nginx. I used a resource dependency in Tilt to make the Ingress resource creation wait for the ingress-nginx-controller Deployment to be ready. Unfortunately, the Ingress creation always fails with the same connection refused error mentioned above.

Update: setting failurePolicy: 'Ignore' in the ValidatingWebhookConfiguration seems to have helped.

Jun 13 '23 02:06 rorylinehan-agora

I had the same symptoms of some requests failing with Connection refused during rollouts, and it looks like the reason I was getting these was not the new pods mistakenly being marked as Ready, but old pods that were about to be terminated having traffic being routed to them. Setting Helm value controller.extraArgs.shutdown-grace-period as low as 10 seemed to help in my case, tested with request interval of 0.5s during rollouts. @zerkms do you think you could give it a try also?

Mar 20 '24 10:03 Kashemir001

ingress-nginx ingress-nginx copied to clipboard

Possible race condition in startup and readiness probe

ingress-nginx
ingress-nginx copied to clipboard