nginx_ingress_controller_orphan_ingress accumulates very many series over time

Open horihel opened this issue 2 years ago • 19 comments

What happened: See Screenshot:

grafik

We've observed prometheus gradually use more and more memory over time - after some inspection we found that nginx_ingress_controller_orphan_ingress exports a really large amount of labels constantly - even for namespaces that don't exist any more for quite a while.

This cluster might be a bit of a special case as it creates/destroys namespaces with a 10-20 ingresses constantly to run tests.

It's easy to see that this adds up and this number does not go down (unless the nginx-pods are killed): grafik

What you expected to happen:

If I understand correctly, labels are usually kept on /metrics, but in this case (and maybe others), it might be worth considering not exporting the metric any more if the ingress has been deleted.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): This is rke2-ingress-nginx as shipped with rke2 v1.25.11+rke2r1.

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       nginx-1.6.4-hardened4
  Build:         git-90e1717ce
  Repository:    https://github.com/rancher/ingress-nginx.git
  nginx version: nginx/1.21.4

-------------------------------------------------------------------------------

I'm not sure if this version is a fork or vendored by Rancher, but glancing at the code it looks like orphans aren't removed in current mainline 1.8.1 too. I didn't test that yet though (sorry).

Kubernetes version (use kubectl version): v1.25.11+rke2r1

Environment: rke2 managed by rancher on vSphere

Cloud provider or hardware configuration: vSphere
OS (e.g. from /etc/os-release): Ubuntu 22.02
Kernel (e.g. uname -a): Linux rke2-ingress-nginx-controller-f7c4b 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Install tools: rancher v2.7.4, rke2 all defaults, ServiceMonitors enabled via helmChartConfig.
- Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.

Basic cluster related info:

kubectl version:

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"windows/amd64"}

Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.11+rke2r1", GitCommit:"8cfcba0b15c343a8dc48567a74c29ec4844e0b9e", GitTreeState:"clean", BuildDate:"2023-06-14T21:31:34Z", GoVersion:"go1.19.10 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"} ```

kubectl get nodes -o wide:

NAME                                           STATUS   ROLES                       AGE   VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
 k8s-development-mgmt-a532ef00-447zr            Ready    control-plane,etcd,master   12d   v1.25.11+rke2r1   10.240.180.85   10.240.180.85   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
 k8s-development-mgmt-a532ef00-n9rqx            Ready    control-plane,etcd,master   12d   v1.25.11+rke2r1   10.240.180.84   10.240.180.84   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
 k8s-development-mgmt-a532ef00-wpxb5            Ready    control-plane,etcd,master   12d   v1.25.11+rke2r1   10.240.180.83   10.240.180.83   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
 k8s-development-workers-6c24g-a8ecb429-kdphr   Ready    worker                      38d   v1.25.11+rke2r1   10.240.180.77   10.240.180.77   Ubuntu 22.04.2 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
 k8s-development-workers-6c24g-a8ecb429-kfmlp   Ready    worker                      95d   v1.25.11+rke2r1   10.240.180.69   10.240.180.69   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
 k8s-development-workers-6c24g-a8ecb429-mh9hx   Ready    worker                      25d   v1.25.11+rke2r1   10.240.180.81   10.240.180.81   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
 k8s-development-workers-6c24g-a8ecb429-t54ww   Ready    worker                      32d   v1.25.11+rke2r1   10.240.180.78   10.240.180.78   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
 k8s-development-workers-6c24g-a8ecb429-w5xc5   Ready    worker                      23d   v1.25.11+rke2r1   10.240.180.82   10.240.180.82   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
 k8s-development-workers-6c24g-a8ecb429-zhm6p   Ready    worker                      95d   v1.25.11+rke2r1   10.240.180.70   10.240.180.70   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1

How was the ingress-nginx-controller installed:
- If helm was used then please show output of helm ls -A | grep -i ingress
```
rke2-ingress-nginx                                      kube-system             12              2023-07-24 08:42:36.043793919 +0000 UTC         deployed        rke2-ingress-nginx-4.5.201
                           1.6.4
```
- If helm was used then please show output of helm -n <ingresscontrollernamepspace> get values <helmreleasename>

controller:
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
global:
  clusterCIDR: 10.42.0.0/16
  clusterCIDRv4: 10.42.0.0/16
  clusterDNS: 10.43.0.10
  clusterDomain: cluster.local
  rke2DataDir: /var/lib/rancher/rke2
  serviceCIDR: 10.43.0.0/16

Current State of the controller:
- kubectl describe ingressclasses

Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=rke2-ingress-nginx
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=rke2-ingress-nginx
              app.kubernetes.io/part-of=rke2-ingress-nginx
              app.kubernetes.io/version=1.6.4
              helm.sh/chart=rke2-ingress-nginx-4.5.201
Annotations:  meta.helm.sh/release-name: rke2-ingress-nginx
              meta.helm.sh/release-namespace: kube-system
Controller:   k8s.io/ingress-nginx
Events:       <none>

kubectl -n <ingresscontrollernamespace> get all -A -o wide
- this is a bit large as this commands will list all cluster resources... (-A)
kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
- rke2 installs ingress-nginx as daemonset - so this produces too much output - here's a single one:

Name:             rke2-ingress-nginx-controller-tkrnv
Namespace:        kube-system
Priority:         0
Service Account:  rke2-ingress-nginx
Node:             k8s-development-workers-6c24g-a8ecb429-zhm6p/10.240.180.70
Start Time:       Mon, 08 May 2023 14:56:51 +0200
Labels:           app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=rke2-ingress-nginx
                  app.kubernetes.io/name=rke2-ingress-nginx
                  controller-revision-hash=6844f6f4b8
                  pod-template-generation=5
Annotations:      cni.projectcalico.org/containerID: 2ba3ae616e3360a86663711c9a643fe810c02d5ebb92278eea5f146be969974b
                  cni.projectcalico.org/podIP: 10.42.166.28/32
                  cni.projectcalico.org/podIPs: 10.42.166.28/32
Status:           Running
IP:               10.42.166.28
IPs:
  IP:           10.42.166.28
Controlled By:  DaemonSet/rke2-ingress-nginx-controller
Containers:
  rke2-ingress-nginx-controller:
    Container ID:  containerd://90ae21ac8a1c87a5387f45fde869bf498af033157df3b3f28c507767ec5cc38b
    Image:         rancher/nginx-ingress-controller:nginx-1.6.4-hardened4
    Image ID:      docker.io/rancher/nginx-ingress-controller@sha256:7804101a5cb8de407b1192e42ea0d6153ac2a71eb1765f63ca4af60a1dbe46f3
    Ports:         80/TCP, 443/TCP, 10254/TCP, 8443/TCP
    Host Ports:    80/TCP, 443/TCP, 0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --election-id=rke2-ingress-nginx-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/rke2-ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
      --watch-ingress-without-class=true
    State:          Running
      Started:      Fri, 30 Jun 2023 10:31:24 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Sat, 17 Jun 2023 09:25:31 +0200
      Finished:     Fri, 30 Jun 2023 10:30:32 +0200
    Ready:          True
    Restart Count:  5
    Requests:
      cpu:      100m
      memory:   90Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       rke2-ingress-nginx-controller-tkrnv (v1:metadata.name)
      POD_NAMESPACE:  kube-system (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lhqr4 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rke2-ingress-nginx-admission
    Optional:    false
  kube-api-access-lhqr4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason  Age                   From                      Message
  ----    ------  ----                  ----                      -------
  Normal  RELOAD  16m (x3612 over 24d)  nginx-ingress-controller  NGINX reload triggered due to a change in configuration

kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
Current state of ingress object, if applicable:
- kubectl -n <appnnamespace> get all,ing -o wide
- kubectl -n <appnamespace> describe ing <ingressname>
- If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag
Others:
- Any other related information like ;
  - copy/paste of the snippet (if applicable)
  - kubectl describe ... of any custom configmap(s) created and in use
  - Any other related information that may help

How to reproduce this issue:

Create a namespace with a few ingresses (doesn't matter if orphaned or not) Delete the namespace Observe metrics for ingresses stay in /metrics, and status for orphaned also stays in there

Anything else we need to know:

Jul 24 '23 09:07 horihel