ingress-nginx icon indicating copy to clipboard operation
ingress-nginx copied to clipboard

Ingress controller keeps increasing the memory when new backend reload action triggered

Open pdefreitas opened this issue 2 years ago • 7 comments

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.1.2 Kubernetes version (use kubectl version): 1.21.9, 1.22.6

Environment:

  • Cloud provider or hardware configuration: Azure Kubernetes Service (AKS)

  • OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS (Bionic Beaver)

  • Kernel (e.g. uname -a): Linux 5.4.0-1070-azure #73~18.04.1-Ubuntu SMP Wed Feb 9 15:36:45 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools: Azure managed

  • Basic cluster related info: Versions mentioned above + cluster autoscaler.

  • How was the ingress-nginx-controller installed:

    • If helm was used then please show output of helm ls -A | grep -i ingress
helm ls -A | grep -i ingress
nginx-ingress-z                         x           26              2022-03-07 00:00:00.000000000 +0000 UTC         deployed        ingress-nginx-4.0.18                    1.1.2

nginx-ingress-y                         y           7               2022-03-17 00:00:00.000000000 +0000 UTC         deployed        ingress-nginx-4.0.18                    1.1.2

nginx-ingress-x                         x           26              2022-03-07 00:00:00.000000000 +0000 UTC         deployed        ingress-nginx-4.0.18                    1.1.2
  • If helm was used then please show output of helm -n <ingresscontrollernamepspace> get values <helmreleasename>

nginx-ingress-x

USER-SUPPLIED VALUES:
controller:
  admissionWebhooks:
    timeoutSeconds: 30
  config:
    enable-modsecurity: true
    hsts: true
    proxy-body-size: 50m
    ssl-protocols: TLSv1.2 TLSv1.3
    ssl-session-cache: false
  electionID: nginx-custom-x
  ingressClass: nginx-custom-x
  ingressClassByName: true
  ingressClassResource:
    controllerValue: k8s.io/nginx-custom-x
    name: nginx-custom-x
  metrics:
    enabled: true
    service:
      annotations:
        prometheus.io/port: "10254"
        prometheus.io/scrape: "true"
  podAnnotations:
    prometheus.io/port: "10254"
    prometheus.io/scrape: "true"
  publishService:
    enabled: true
  rbac:
    create: true
  resources:
    limits:
      memory: 1200Mi
    requests:
      cpu: 100m
      memory: 1000Mi
  scope:
    enabled: true
  service:
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-resource-group: xxx
    externalTrafficPolicy: Local
    loadBalancerIP: x.x.x.x
  startupProbe:
    failureThreshold: 5
    httpGet:
      path: /healthz
      port: 10254
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 2

nginx-ingress-z

USER-SUPPLIED VALUES:
controller:
  admissionWebhooks:
    timeoutSeconds: 30
  config:
    enable-modsecurity: true
    enable-real-ip: "true"
    hsts: true
    proxy-body-size: 50m
    ssl-protocols: TLSv1.2 TLSv1.3
    ssl-session-cache: false
    use-proxy-protocol: "true"
  electionID: nginx-custom-z
  ingressClass: nginx-custom-z
  ingressClassByName: true
  ingressClassResource:
    controllerValue: k8s.io/nginx-custom-z
    name: nginx-custom-z
  metrics:
    enabled: true
    service:
      annotations:
        prometheus.io/port: "10254"
        prometheus.io/scrape: "true"
  podAnnotations:
    prometheus.io/port: "10254"
    prometheus.io/scrape: "true"
  publishService:
    enabled: true
  rbac:
    create: true
  resources:
    limits:
      memory: 800Mi
    requests:
      cpu: 100m
      memory: 500Mi
  scope:
    enabled: true
  service:
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-internal: true
      service.beta.kubernetes.io/azure-load-balancer-resource-group: xxx
    loadBalancerIP: x.x.x.x
  startupProbe:
    failureThreshold: 5
    httpGet:
      path: /healthz
      port: 10254
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 2
  • if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances

    • Ingress controller nginx-ingress-y on namespace y is not leaking memory.
  • Current State of the controller:

    • All ingress controllers work properly until they get killed. nginx-ingress-y that is alone on its own namespace doesn't have any issue, it has a similar configuration to nginx-ingress-x. nginx-ingress-z eventually runs out of memory (not so frequent because it doesn't have so many ingress rules). nginx-ingress-x is the most problematic.
  • Current state of ingress object, if applicable:

    • Ingress changes are properly applied to both controllers. Sometimes the ingress controllers get out of memory limits (OOM) and happens a similar behavior to #8325 and #7086.

What happened:

Ingress controllers nginx-ingress-z and nginx-ingress-x are leaking memory over time. We noticed that the memory increases when there are backend reload operations happening.

What you expected to happen:

I would expect the memory to be kept constant in-between backend reloads (releasing memory). Issues #8166, #8336 and #8357 exhibit similar behavior in a similar setup.

How to reproduce it:

  • Install two ingress controllers in the same namespace with the user-supplied values from above.
  • Add multiple ingress rules to each ingress controller.
    • nginx-ingress-x has ~10 ingress resources with ModSecurity + OWASP ModSecurity Core Rule Set.
    • nginx-ingress-z has ~7 ingress resources.
  • Force backend to reload, you will notice that memory increases on each reload eventually causing OOM.
  • Pods will be stuck in CrashRestartLoop due to #8325 and #7086. You end up having to scale the deployment to zero and scaling it back up again to launch a new pod.

Anything else we need to know:

Ingress rules on nginx-ingress-x have ModSecurity + OWASP ModSecurity Core Rule Set annotations. nginx-ingress-z handles internal traffic (virtual network level), and it uses proxy protocol. This setup was working fine without any kind of memory increase prior to 0.48.x. We had to upgrade to 1.x.x due to Kubernetes upgrade + security patches. The same issue happens without Prometheus metrics enabled (we've enabled them for troubleshooting purposes).

pdefreitas avatar Mar 21 '22 10:03 pdefreitas

@pdefreitas: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 21 '22 10:03 k8s-ci-robot

/remove-kind bug /kind feature

Install each instance of the ingress-nginx controller in its own namespaces. Its documented. The issues you have listed are not the same problem when compared in all related aspects. When higher-priority issues are resolved, the developers will get time to work on namespace related functionality. For now, install each instance of the ingress-nginx controller in its own dedicated namespace and do not install another instance of the controller in the same namespace.

longwuyuan avatar Mar 21 '22 11:03 longwuyuan

@longwuyuan thanks for the prompt reply but there are multiple problems to address:

  • There are multiple ticket open and comments reporting an usual amount of memory being consumed by latest ingress controller releases. Wouldn't it be worth to understand what causes it?
  • In regards to the comment namespace related functionality doe you believe it would cause high memory? Because with the configuration above the setup works perfectly. The current documentation does not mention such limitation:
    • https://kubernetes.github.io/ingress-nginx/user-guide/multiple-ingress/
    • https://kubernetes.github.io/ingress-nginx/#how-to-easily-install-multiple-instances-of-the-ingress-nginx-controller-in-the-same-cluster (does not mention as hard requirement).
  • PRs #8325 and/or #7086 are stuck to be released for long time. On our setup we're able to reproduce this bug when controller container gets OOM'd.

pdefreitas avatar Mar 21 '22 11:03 pdefreitas

From my limited visibility, I can state that ;

  • multiple distinct problems are likely experienced by one user, but not a large set of real users in production
  • memory allocation and then failure to release is a very very precise short-description of a problem. But no user has provided a step-by-step procedure that someone else can copy/paste from and reproduce. Some of the the generic description of the problem of memory usage spiralling out of control is invalid (for example a infinite for loop in bash creating ingress objects at the speed of the multicore server class cpu)
  • There is shortage of developers so if there is a triage completed, it will result in a usable definition of the problem and a reproducible sequence of steps that anyone can use to recreate the problem on their kind/minikube cluster. If the triage results in a relatively clear action item, then developers can set priority accordingly. It seems unfair to have anyone repeat taks for gathering data to reproduce a problem

longwuyuan avatar Mar 21 '22 11:03 longwuyuan

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 19 '22 12:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 19 '22 12:07 k8s-triage-robot

Having exactly the same issue withe a very similar config.

Jojoooo1 avatar Aug 08 '22 12:08 Jojoooo1

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Sep 07 '22 22:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 07 '22 22:09 k8s-ci-robot