ingress-nginx
ingress-nginx copied to clipboard
status is reporting wrong? number of connections.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
bash-5.1$ /nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v1.1.0
Build: cacbee86b6ccc45bde8ffc184521bed3022e7dee
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.9
-------------------------------------------------------------------------------
Kubernetes version (use kubectl version):
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.15-gke.1000", GitCommit:"d71f5620130949cf5f74de04e6ae8f3a96e4b718", GitTreeState:"clean", BuildDate:"2022-02-02T09:21:18Z", GoVersion:"go1.15.15b5", Compiler:"gc", Platform:"linux/amd64"}
Environment:
-
Cloud provider or hardware configuration: GCP
-
OS (e.g. from /etc/os-release): COS
-
Kernel (e.g.
uname -a):Linux ingress-nginx-machines-controller-7f54d7c564-zv5fp 5.4.144+ #1 SMP Sat Sep 25 09:56:01 PDT 2021 x86_64 Linux -
How was the ingress-nginx-controller installed:
# helm ls -A | grep ingress
ingress-nginx-1 ingress-nginx 12 2022-02-02 12:01:24.840632114 +0100 CETdeployed ingress-nginx-4.0.11 1.1.0
ingress-nginx-2 ingress-nginx 2 2022-02-02 11:59:48.330927655 +0100 CETdeployed ingress-nginx-4.0.11 1.1.0
ingress-nginx-machines ingress-nginx 9 2022-03-29 11:34:23.273927 +0200 CEST deployed ingress-nginx-4.0.11 1.1.0
# helm -n ingress-nginx get values ingress-nginx-machines
USER-SUPPLIED VALUES:
controller:
admissionWebhooks:
enabled: false
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- gke-pool-2-prod-2
autoscaling:
enabled: true
maxReplicas: 18
minReplicas: 2
targetCPUUtilizationPercentage: 50
targetMemoryUtilizationPercentage: 70
config:
http2-max-requests: "1000000000"
log-format-escape-json: "true"
log-format-upstream: '{"body_bytes_sent": $body_bytes_sent, "http_referer": "$http_referer",
"proxy_upstream_name": "$proxy_upstream_name", "upstream_addr": "$upstream_addr",
"upstream_response_length": $upstream_response_length, "upstream_response_time":
$upstream_response_time, "upstream_status": $upstream_status, "upstream_connect_time":
$upstream_connect_time, "upstream_header_time": $upstream_header_time, "time_iso8601":
"$time_iso8601", "proxy_add_x_forwarded_for": "$proxy_add_x_forwarded_for",
"remote_user": "$remote_user", "bytes_sent": $bytes_sent, "request_time": $request_time,
"status": $status, "host": "$host", "request_length": $request_length, "http_referer":
"$http_referer", "http_user_agent": "$http_user_agent", "remote_addr": "$remote_addr",
"request_time": $request_time, "request": "$request"}'
proxy-next-upstream: "off"
proxy-next-upstream-tries: "1"
ssl-ciphers: ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-ECDSA-RC4-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-RSA-RC4-SHA
ssl-protocols: TLSv1 TLSv1.1 TLSv1.2 TLSv1.3
update-status: false
electionID: ingress-controller-leader-machines
extraArgs:
update-status: false
ingressClassResource:
controllerValue: k8s.io/ingress-nginx-machines
default: false
name: nginx-machines
metrics:
enabled: true
service:
annotations: {}
omitClusterIP: true
prometheus.io/port: "10254"
prometheus.io/scrape: "true"
servicePort: 9913
type: ClusterIP
serviceMonitor:
additionalLabels:
release: prometheus
enabled: true
metricRelabelings:
- action: drop
regex: nginx_ingress_controller_(response_size_bucket|bytes_sent_bucket|request_size_bucket|request_duration_seconds_bucket|response_duration_seconds_bucket)
sourceLabels:
- __name__
- regex: redirect-service
replacement: "false"
sourceLabels:
- exported_namespace
targetLabel: __tmp_keep_me
- regex: nginx_ingress_controller_requests
replacement: "true"
sourceLabels:
- __name__
targetLabel: __tmp_keep_me
- action: drop
regex: "false"
sourceLabels:
- __tmp_keep_me
- sourceLabels:
- exported_namespace
targetLabel: namespace
minAvailable: 1
minReadySeconds: 5
replicaCount: 2
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 1000m
memory: 1Gi
service:
externalTrafficPolicy: Local
loadBalancerIP: 35.189.221.132
updateStrategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
watchIngressWithoutClass: false
defaultBackend:
enabled: true
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
fullnameOverride: ingress-nginx-machines
nameOverride: ingress-nginx-machines
tcp:
"8443": ingress-nginx/ingress-nginx-machines-controller:443
- Current state of ingress object, if applicable:
kubectl -n <appnnamespace> get all,ing -o widekubectl -n <appnamespace> describe ing <ingressname>- If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
cert-manager.io/cluster-issuer: shortchain-letsencrypt
meta.helm.sh/release-name: device-config-prod
meta.helm.sh/release-namespace: device-config-prod
nginx.ingress.kubernetes.io/proxy-read-timeout: "14400"
What happened:
The prometheus metrics are reporting more connections than the actual number reported by netstat. In this case it's an ~45% overflow, 7700 reported by netstat and nginx reports 13400.
bash-5.1$ curl -s localhost:10254/metrics | grep -E "^nginx_ingress_controller_nginx_process_connections.*active"
nginx_ingress_controller_nginx_process_connections{controller_class="k8s.io/ingress-nginx-machines",controller_namespace="ingress-nginx",controller_pod="ingress-nginx-machines-controller-7f54d7c564-zv5fp",state="active"} 13406
bash-5.1$ curl -s localhost/nginx_status
Active connections: 13419
server accepts handled requests
18528703 18528703 93039607
Reading: 0 Writing: 6893 Waiting: 6519
bash-5.1$ netstat -na | grep ESTABLISHED | grep :443 | wc -l
7699
What you expected to happen:
Expected to be roughly the same numbers reported by the active connections and netstat.
How to reproduce it:
Anything else we need to know:
We use persistent connections ( ~18k ) for SSE, and under some circumstances the number of connections reported by nginx is being bumped by roughly the same amount of connections ( ~18k ) although this increase cannot be observed in the netstat output. A restart of the deployment solves the problem, and the metrics are aligned to what netstat reports.
@nrobert13: This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi @nrobert13 ,
If I use this doc https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/ on a kind cluster, will I be able to reproduce this problem ?
/remove-kind bug
/kind support
@longwuyuan , thanks for the reply. You don't need to setup the prometheus/grafana stack for this. if you shell into the nginx controller you can pull the metrics, see my snippet in the What happened section.
Hi @nrobert13 ,
I think its caused by resource utilisation or locks/delays on your environment. I am unable to reproduce the problem ;
@longwuyuan thanks for looking into this. I suspect the behaviour is related to the persistent connections ( see my ingress resource snippet ). The clients are opening a keep-alive connection to nginx, and nginx keeps them alive towards the upstream ( backend ) with the nginx.ingress.kubernetes.io/proxy-read-timeout: "14400" . At the time the connection count bumps, we see big amount of RESET's in the upstream ( backend ) service, which makes me think, that these resets are not counted by nginx metrics, and the newly created connections are just added on top.
I agree. What you describe (keepalives) and multiple other use-cases will not have been instrumented into the metrics I think. It looks a deep dive will be required and a appropriate instrumentation for managing such custom configs will need to be developed.
The ingress-nginx project does not have enough resources to do this kind of development now. Would be interested in submitting a PR on this. I am not a developer so hard for me to deep dive into this.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen - Mark this issue or PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen- Mark this issue or PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.