ingress-nginx icon indicating copy to clipboard operation
ingress-nginx copied to clipboard

status is reporting wrong? number of connections.

Open nrobert13 opened this issue 3 years ago • 9 comments

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

bash-5.1$ /nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.1.0
  Build:         cacbee86b6ccc45bde8ffc184521bed3022e7dee
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.9

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version):

kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.15-gke.1000", GitCommit:"d71f5620130949cf5f74de04e6ae8f3a96e4b718", GitTreeState:"clean", BuildDate:"2022-02-02T09:21:18Z", GoVersion:"go1.15.15b5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: GCP

  • OS (e.g. from /etc/os-release): COS

  • Kernel (e.g. uname -a): Linux ingress-nginx-machines-controller-7f54d7c564-zv5fp 5.4.144+ #1 SMP Sat Sep 25 09:56:01 PDT 2021 x86_64 Linux

  • How was the ingress-nginx-controller installed:

# helm ls -A | grep ingress

ingress-nginx-1             	ingress-nginx               	12      	2022-02-02 12:01:24.840632114 +0100 CETdeployed	ingress-nginx-4.0.11        	1.1.0
ingress-nginx-2             	ingress-nginx               	2       	2022-02-02 11:59:48.330927655 +0100 CETdeployed	ingress-nginx-4.0.11        	1.1.0
ingress-nginx-machines      	ingress-nginx               	9       	2022-03-29 11:34:23.273927 +0200 CEST  deployed	ingress-nginx-4.0.11        	1.1.0

# helm -n ingress-nginx get values ingress-nginx-machines
USER-SUPPLIED VALUES:
controller:
  admissionWebhooks:
    enabled: false
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-nodepool
            operator: In
            values:
            - gke-pool-2-prod-2
  autoscaling:
    enabled: true
    maxReplicas: 18
    minReplicas: 2
    targetCPUUtilizationPercentage: 50
    targetMemoryUtilizationPercentage: 70
  config:
    http2-max-requests: "1000000000"
    log-format-escape-json: "true"
    log-format-upstream: '{"body_bytes_sent": $body_bytes_sent, "http_referer": "$http_referer",
      "proxy_upstream_name": "$proxy_upstream_name", "upstream_addr": "$upstream_addr",
      "upstream_response_length": $upstream_response_length, "upstream_response_time":
      $upstream_response_time, "upstream_status": $upstream_status, "upstream_connect_time":
      $upstream_connect_time, "upstream_header_time": $upstream_header_time, "time_iso8601":
      "$time_iso8601", "proxy_add_x_forwarded_for": "$proxy_add_x_forwarded_for",
      "remote_user": "$remote_user", "bytes_sent": $bytes_sent, "request_time": $request_time,
      "status": $status, "host": "$host", "request_length": $request_length, "http_referer":
      "$http_referer", "http_user_agent": "$http_user_agent", "remote_addr": "$remote_addr",
      "request_time": $request_time, "request": "$request"}'
    proxy-next-upstream: "off"
    proxy-next-upstream-tries: "1"
    ssl-ciphers: ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-ECDSA-RC4-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-RSA-RC4-SHA
    ssl-protocols: TLSv1 TLSv1.1 TLSv1.2 TLSv1.3
    update-status: false
  electionID: ingress-controller-leader-machines
  extraArgs:
    update-status: false
  ingressClassResource:
    controllerValue: k8s.io/ingress-nginx-machines
    default: false
    name: nginx-machines
  metrics:
    enabled: true
    service:
      annotations: {}
      omitClusterIP: true
      prometheus.io/port: "10254"
      prometheus.io/scrape: "true"
      servicePort: 9913
      type: ClusterIP
    serviceMonitor:
      additionalLabels:
        release: prometheus
      enabled: true
      metricRelabelings:
      - action: drop
        regex: nginx_ingress_controller_(response_size_bucket|bytes_sent_bucket|request_size_bucket|request_duration_seconds_bucket|response_duration_seconds_bucket)
        sourceLabels:
        - __name__
      - regex: redirect-service
        replacement: "false"
        sourceLabels:
        - exported_namespace
        targetLabel: __tmp_keep_me
      - regex: nginx_ingress_controller_requests
        replacement: "true"
        sourceLabels:
        - __name__
        targetLabel: __tmp_keep_me
      - action: drop
        regex: "false"
        sourceLabels:
        - __tmp_keep_me
      - sourceLabels:
        - exported_namespace
        targetLabel: namespace
  minAvailable: 1
  minReadySeconds: 5
  replicaCount: 2
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 1000m
      memory: 1Gi
  service:
    externalTrafficPolicy: Local
    loadBalancerIP: 35.189.221.132
  updateStrategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  watchIngressWithoutClass: false
defaultBackend:
  enabled: true
  resources:
    limits:
      cpu: 10m
      memory: 20Mi
    requests:
      cpu: 10m
      memory: 20Mi
fullnameOverride: ingress-nginx-machines
nameOverride: ingress-nginx-machines
tcp:
  "8443": ingress-nginx/ingress-nginx-machines-controller:443

  • Current state of ingress object, if applicable:
    • kubectl -n <appnnamespace> get all,ing -o wide
    • kubectl -n <appnamespace> describe ing <ingressname>
    • If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/cluster-issuer: shortchain-letsencrypt
    meta.helm.sh/release-name: device-config-prod
    meta.helm.sh/release-namespace: device-config-prod
    nginx.ingress.kubernetes.io/proxy-read-timeout: "14400"

What happened:

The prometheus metrics are reporting more connections than the actual number reported by netstat. In this case it's an ~45% overflow, 7700 reported by netstat and nginx reports 13400.

bash-5.1$ curl -s localhost:10254/metrics | grep -E "^nginx_ingress_controller_nginx_process_connections.*active"
nginx_ingress_controller_nginx_process_connections{controller_class="k8s.io/ingress-nginx-machines",controller_namespace="ingress-nginx",controller_pod="ingress-nginx-machines-controller-7f54d7c564-zv5fp",state="active"} 13406
bash-5.1$ curl -s localhost/nginx_status
Active connections: 13419
server accepts handled requests
 18528703 18528703 93039607
Reading: 0 Writing: 6893 Waiting: 6519
bash-5.1$ netstat -na | grep ESTABLISHED | grep :443 | wc -l
7699

What you expected to happen:

Expected to be roughly the same numbers reported by the active connections and netstat.

How to reproduce it:

Anything else we need to know:

We use persistent connections ( ~18k ) for SSE, and under some circumstances the number of connections reported by nginx is being bumped by roughly the same amount of connections ( ~18k ) although this increase cannot be observed in the netstat output. A restart of the deployment solves the problem, and the metrics are aligned to what netstat reports.

nrobert13 avatar Apr 06 '22 13:04 nrobert13

@nrobert13: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 06 '22 13:04 k8s-ci-robot

Hi @nrobert13 ,

If I use this doc https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/ on a kind cluster, will I be able to reproduce this problem ?

/remove-kind bug

longwuyuan avatar Apr 07 '22 04:04 longwuyuan

/kind support

longwuyuan avatar Apr 07 '22 05:04 longwuyuan

@longwuyuan , thanks for the reply. You don't need to setup the prometheus/grafana stack for this. if you shell into the nginx controller you can pull the metrics, see my snippet in the What happened section.

nrobert13 avatar Apr 07 '22 10:04 nrobert13

Hi @nrobert13 ,

I think its caused by resource utilisation or locks/delays on your environment. I am unable to reproduce the problem ;

image

longwuyuan avatar Apr 07 '22 17:04 longwuyuan

@longwuyuan thanks for looking into this. I suspect the behaviour is related to the persistent connections ( see my ingress resource snippet ). The clients are opening a keep-alive connection to nginx, and nginx keeps them alive towards the upstream ( backend ) with the nginx.ingress.kubernetes.io/proxy-read-timeout: "14400" . At the time the connection count bumps, we see big amount of RESET's in the upstream ( backend ) service, which makes me think, that these resets are not counted by nginx metrics, and the newly created connections are just added on top.

nrobert13 avatar Apr 08 '22 05:04 nrobert13

I agree. What you describe (keepalives) and multiple other use-cases will not have been instrumented into the metrics I think. It looks a deep dive will be required and a appropriate instrumentation for managing such custom configs will need to be developed.

The ingress-nginx project does not have enough resources to do this kind of development now. Would be interested in submitting a PR on this. I am not a developer so hard for me to deep dive into this.

longwuyuan avatar Apr 08 '22 05:04 longwuyuan

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 07 '22 06:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 06 '22 06:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Sep 05 '22 06:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 05 '22 06:09 k8s-ci-robot