ingress-nginx icon indicating copy to clipboard operation
ingress-nginx copied to clipboard

v4.11.1 unexpected error obtaining nginx status info

Open Kampe opened this issue 1 year ago • 3 comments

Seeing issues in nginx startup, not seeing much in relation to why there's issues with the healthcheck response.

I0727 00:08:28.342380       7 nginx.go:317] "Starting NGINX process"
I0727 00:08:28.342455       7 leaderelection.go:250] attempting to acquire leader lease ingress-nginx/ingress-nginx-internal-leader...
I0727 00:08:28.342749       7 nginx.go:337] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I0727 00:08:28.345201       7 controller.go:193] "Configuration changes detected, backend reload required"
I0727 00:08:28.358021       7 status.go:85] "New leader elected" identity="ingress-nginx-internal-controller-67bfb7fd4b-nzkdt"
2024/07/27 00:08:35 Get "http://127.0.0.1:10246/nginx_status": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:08:35.677958       7 nginx_status.go:171] unexpected error obtaining nginx status info: Get "http://127.0.0.1:10246/nginx_status": dial tcp 127.0.0.1:10246: connect: connection refused
2024/07/27 00:09:05 Get "http://127.0.0.1:10246/nginx_status": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:09:05.683341       7 nginx_status.go:171] unexpected error obtaining nginx status info: Get "http://127.0.0.1:10246/nginx_status": dial tcp 127.0.0.1:10246: connect: connection refused
I0727 00:09:07.380630       7 controller.go:213] "Backend successfully reloaded"
I0727 00:09:07.380716       7 controller.go:224] "Initial sync, sleeping for 1 second"
I0727 00:09:07.380802       7 event.go:377] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-internal-controller-dbcc4dc9c-29mpv", UID:"4ee6bf1d-df1f-4bb4-8e37-04d6978dfd6d", APIVersion:"v1", ResourceVersion:"214163955", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
W0727 00:09:08.382382       7 controller.go:244] Dynamic reconfiguration failed (retrying; 15 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:09:09.394353       7 controller.go:244] Dynamic reconfiguration failed (retrying; 14 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:09:10.797697       7 controller.go:244] Dynamic reconfiguration failed (retrying; 13 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:09:12.616922       7 controller.go:244] Dynamic reconfiguration failed (retrying; 12 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:09:14.913299       7 controller.go:244] Dynamic reconfiguration failed (retrying; 11 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
I0727 00:09:16.276657       7 sigterm.go:36] "Received SIGTERM, shutting down"
I0727 00:09:16.276928       7 nginx.go:393] "Shutting down controller queues"
I0727 00:09:16.289355       7 nginx.go:401] "Stopping admission controller"
E0727 00:09:16.289652       7 nginx.go:340] "Error listening for TLS connections" err="http: Server closed"
I0727 00:09:16.289815       7 nginx.go:409] "Stopping NGINX process"
W0727 00:09:17.931239       7 controller.go:244] Dynamic reconfiguration failed (retrying; 10 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:09:21.837363       7 controller.go:244] Dynamic reconfiguration failed (retrying; 9 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:09:26.847362       7 controller.go:244] Dynamic reconfiguration failed (retrying; 8 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:09:33.648965       7 controller.go:244] Dynamic reconfiguration failed (retrying; 7 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
2024/07/27 00:09:16 [notice] 2486#2486: ModSecurity-nginx v1.0.3 (rules loaded inline/local/remote: 0/14418/0)
2024/07/27 00:09:16 [notice] 2486#2486: signal process started
W0727 00:09:41.869474       7 controller.go:244] Dynamic reconfiguration failed (retrying; 6 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
W0727 00:09:53.470106       7 controller.go:244] Dynamic reconfiguration failed (retrying; 5 retries left): Post "http://127.0.0.1:10246/configuration/backends": dial tcp 127.0.0.1:10246: connect: connection refused
I0727 00:09:59.244212       7 nginx.go:422] "NGINX process has stopped"
I0727 00:09:59.244234       7 sigterm.go:44] Handled quit, delaying controller exit for 10 seconds

What happened:

Upgraded my helm chart from v4.10.0 to v4.11.1

What you expected to happen:

All pods are replaced and working without issue.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller
  Release:       v1.11.1
  Build:         7c44f992012555ff7f4e47c08d7c542ca9b4b1f7
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.25.5

Kubernetes version (use kubectl version):

Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-eks-036c24b

Environment: AWS EKS

  • How was the ingress-nginx-controller installed:
values: |
        fullnameOverride: ingress-nginx-internal
        controller:
          replicaCount: 3
          autoscaling:
            enabled: true
            minReplicas: 3
            targetCPUUtilizationPercentage: 80
            targetMemoryUtilizationPercentage: 80
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
          ingressClassResource:
            name: "nginx-internal"
            controllerValue: "k8s.io/ingress-nginx-internal"
            enabled: true
            default: true
          opentelemetry:
            enabled: true
          admissionWebhooks:
            timeoutSeconds: 30

          config:
            allow-snippet-annotations: "true"
            otlp-collector-host: "opentelemetry-collector.monitoring.svc"
            otlp-collector-port: "4317"
            enable-opentelemetry: "true"
            otel-sampler: "AlwaysOn"
            otel-sampler-ratio: "1.0"
            enable-underscores-in-headers: "true"
            opentelemetry-config: "/etc/nginx/opentelemetry.toml"
            opentelemetry-operation-name: "HTTP $request_method $service_name $uri"
            opentelemetry-trust-incoming-span: "false"
            otel-sampler-parent-based: "false"
            otel-max-queuesize: "2048"
            otel-schedule-delay-millis: "5000"
            otel-max-export-batch-size: "512"
            server-snippet: |
              opentelemetry_attribute "ingress.namespace" "$namespace";
              opentelemetry_attribute "ingress.service_name" "$service_name";
              opentelemetry_attribute "ingress.name" "$ingress_name";
              opentelemetry_attribute "ingress.upstream" "$proxy_upstream_name";

          metrics:
            enabled: true
            serviceMonitor:
              enabled: true
          service:
            public: false
            subdomain: "ingress-internal"
            external:
              enabled: false
            internal:
              enabled: true
              annotations: 
                service.beta.kubernetes.io/aws-load-balancer-type: nlb-ip
                service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
                service.beta.kubernetes.io/aws-load-balancer-scheme: internal
                service.beta.kubernetes.io/aws-load-balancer-internal: "true"
                service.beta.kubernetes.io/aws-load-balancer-attributes: deletion_protection.enabled=true
  • Current State of the controller:
Name:         nginx-internal
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress-nginx-internal
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.11.1
              argocd.argoproj.io/instance=ingress-nginx-internal
              helm.sh/chart=ingress-nginx-4.11.1
Annotations:  argocd.argoproj.io/tracking-id: ingress-nginx-internal:networking.k8s.io/IngressClass:ingress-nginx/nginx-internal
              ingressclass.kubernetes.io/is-default-class: true
Controller:   k8s.io/ingress-nginx-internal
Events:       <none>

Kampe avatar Jul 27 '24 00:07 Kampe

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 27 '24 00:07 k8s-ci-robot

/remove-kind bug /kind support

Please try to add the AWS documented annotation related to security-groups. It could be that you are blocking required ports so check if the required ports are opened inside the cluster (look at the multiple port fields inside the pod for port numbers).

You have not answered any questions asked in the template of a new issue so there is nothing to debug and analyze here. Answer the questions asked in the new issue template to help out.

/triage needs-information

longwuyuan avatar Jul 27 '24 01:07 longwuyuan

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

github-actions[bot] avatar Aug 26 '24 01:08 github-actions[bot]

We are currently experiencing the same issue after upgrading from 4.10 to 4.11.3. What kind of information could we provide to help debug this @longwuyuan? We have rolled back to 4.10.5 for the time being.

We are on AWS, using EKS (same as OP), running Kubernetes version 1.30.4 on most of our worker nodes (in contrast to OP's 1.29).

We have a staging cluster where we can reproduce the issue so I can provide any information that might be useful without impacting our day-to-day operations.

naanselmo avatar Oct 29 '24 19:10 naanselmo

I updated the docs with some AWS related annotations, specific to healthcheck https://kubernetes.github.io/ingress-nginx/deploy/ . @naanselmo , you can see if it relates.

From history, one data is clear. The error message in the controller log is really precise in indicating the root-cause. Interpreting that root-cause as a blocked port or a temporary failure to establish connection for healthcheck, is based on the data from the cluster user.

longwuyuan avatar Oct 29 '24 20:10 longwuyuan

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

github-actions[bot] avatar Dec 06 '24 02:12 github-actions[bot]

Seeing similar issue, investigating right now but it seems to be linked to having nginx.ingress.kubernetes.io/enable-modsecurity: "true" in many ingresses: https://github.com/kubernetes/ingress-nginx/issues/12927

champtar avatar Feb 11 '25 20:02 champtar