kubernetes-ingress icon indicating copy to clipboard operation
kubernetes-ingress copied to clipboard

A rollout restart, or scale down of haproxy causes 503 connection timeout errors

Open pnmatich opened this issue 3 years ago • 11 comments

I'm using helm to run HaProxy ingress with autoscaling enabled (chart version 1.21.1). Whenever an HaProxy pod terminates (because of a scale down event, or a rollout restart), I start seeing 503 backend connection timeout errors for a few seconds.

hapoxy-termination-errors

I tried adding the following example config for graceful shutdown, but that did not resolve the issue:

## Example preStop for graceful shutdown
lifecycle: {}
    preStop:
      exec:
        command: ["/bin/sh", "-c", "kill -USR1 $(pidof haproxy); while killall -0 haproxy; do sleep 1; done"]

pnmatich avatar Jun 16 '22 20:06 pnmatich

Here are some logs from a HaProxy pod that is being scaled down:

Date,Message
"2022-06-17T00:41:48.795Z","[s6-init] making user provided files available at /var/run/s6/etc...exited 0."
"2022-06-17T00:41:48.795Z","[s6-init] ensuring user provided files have correct perms...exited 0."
"2022-06-17T00:41:48.795Z","[fix-attrs.d] applying ownership & permissions fixes..."
"2022-06-17T00:41:48.795Z","[fix-attrs.d] done."
"2022-06-17T00:41:48.795Z","[cont-init.d] executing container initialization scripts..."
"2022-06-17T00:41:48.795Z","[cont-init.d] 01-aux-cfg: executing..."
"2022-06-17T00:41:48.795Z","[cont-init.d] 01-aux-cfg: exited 0."
"2022-06-17T00:41:48.795Z","[cont-init.d] done."
"2022-06-17T00:41:48.795Z","[services.d] starting services"
"2022-06-17T00:41:48.795Z","[services.d] done."
"2022-06-17T00:41:48.795Z","[WARNING] (212) : config : missing timeouts for frontend 'https'."
"2022-06-17T00:41:48.795Z","| While not properly invalid
"2022-06-17T00:41:48.795Z","| with such a configuration. To fix this
"2022-06-17T00:41:48.795Z","| timeouts are set to a non-zero value: 'client'
"2022-06-17T00:41:48.795Z","[WARNING] (212) : config : missing timeouts for frontend 'http'."
"2022-06-17T00:41:48.795Z","| While not properly invalid
"2022-06-17T00:41:48.795Z","| with such a configuration. To fix this
"2022-06-17T00:41:48.795Z","| timeouts are set to a non-zero value: 'client'
"2022-06-17T00:41:48.795Z","[WARNING] (212) : config : missing timeouts for frontend 'healthz'."
"2022-06-17T00:41:48.795Z","| While not properly invalid
"2022-06-17T00:41:48.795Z","| with such a configuration. To fix this
"2022-06-17T00:41:48.795Z","| timeouts are set to a non-zero value: 'client'
"2022-06-17T00:41:48.795Z","[WARNING] (212) : config : missing timeouts for frontend 'stats'."
"2022-06-17T00:41:48.795Z","| While not properly invalid
"2022-06-17T00:41:48.795Z","| with such a configuration. To fix this
"2022-06-17T00:41:48.795Z","| timeouts are set to a non-zero value: 'client'
"2022-06-17T00:41:48.795Z","[WARNING] (212) : Removing incomplete section 'peers localinstance' (no peer named 'haproxy-kubernetes-ingress-79987ccbf5-qs4zv')."
"2022-06-17T00:41:48.795Z","2022/06/17 00:41:41"
"2022-06-17T00:41:48.795Z","_ _ _ ____"
"2022-06-17T00:41:48.795Z","| | | | / \ | _ \ _ __ _____ ___ _"
"2022-06-17T00:41:48.795Z","| |_| | / _ \ | |_) | '__/ _ \ \/ / | | |"
"2022-06-17T00:41:48.795Z","| _ |/ ___ \| __/| | | (_) > <| |_| |"
"2022-06-17T00:41:48.796Z","|_| |_/_/ \_\_| |_| \___/_/\_\\__
"2022-06-17T00:41:48.796Z","_ __ _ |___/ ___ ____"
"2022-06-17T00:41:48.796Z","| |/ / _| |__ ___ _ __ _ __ ___| |_ ___ ___ |_ _/ ___|"
"2022-06-17T00:41:48.796Z","| ' / | | | '_ \ / _ \ '__| '_ \ / _ \ __/ _ \/ __| | | |"
"2022-06-17T00:41:48.796Z","| . \ |_| | |_) | __/ | | | | | __/ || __/\__ \ | | |___"
"2022-06-17T00:41:48.796Z","|_|\_\__
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 HAProxy Ingress Controller v1.7.9 6462c78"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Build from: https://github.com/haproxytech/kubernetes-ingress"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Build date: 2022-04-12T09:39:37"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 ConfigMap: haproxy/haproxy-kubernetes-ingress"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Ingress class: haproxy"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Empty Ingress class: false"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Publish service:"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Default backend service: haproxy/haproxy-kubernetes-ingress-default-backend"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Default ssl certificate: haproxy/eu-west-1-honeydew-epcloudops-com-tls"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Frontend HTTP listening on: 0.0.0.0:80"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Frontend HTTPS listening on: 0.0.0.0:443"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Controller sync period: 5s"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 Running on haproxy-kubernetes-ingress-79987ccbf5-qs4zv"
"2022-06-17T00:41:48.796Z","[NOTICE] (212) : New worker #1 (241) forked"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 haproxy.go:36 Running with HAProxy version 2.4.15-7782e23 2022/03/14 - https://haproxy.org/"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 haproxy.go:50 Starting HAProxy with /etc/haproxy/haproxy.cfg"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 controller.go:116 Running on Kubernetes version: v1.21.12-eks-a64ea69 linux/amd64"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 INFO crmanager.go:75 Global CR defined in API core.haproxy.org"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 INFO crmanager.go:75 Defaults CR defined in API core.haproxy.org"
"2022-06-17T00:41:48.796Z","2022/06/17 00:41:41 INFO crmanager.go:75 Backend CR defined in API core.haproxy.org"
"2022-06-17T00:41:50.797Z","2022/06/17 00:41:49 INFO monitor.go:260 Auxiliary HAProxy config '/etc/haproxy/haproxy-aux.cfg' updated"
"2022-06-17T00:41:51.797Z","[WARNING] (212) : Exiting Master process..."
"2022-06-17T00:41:51.797Z","2022/06/17 00:41:51 INFO controller.go:202 HAProxy restarted"
"2022-06-17T00:41:51.797Z","[NOTICE] (212) : haproxy version is 2.4.15-7782e23"
"2022-06-17T00:41:51.797Z","[ALERT] (212) : Current worker #1 (241) exited with code 143 (Terminated)"
"2022-06-17T00:41:51.798Z","[WARNING] (212) : All workers exited. Exiting... (0)"
"2022-06-17T00:41:51.798Z","[WARNING] (264) : config: Can't get version of the global server state file '/var/state/haproxy/global'."
"2022-06-17T00:41:52.798Z","[NOTICE] (264) : New worker #1 (267) forked"

pnmatich avatar Jun 17 '22 18:06 pnmatich

I have same issue with haproxy ingress controller.

Falc0nreeper avatar Jun 21 '22 10:06 Falc0nreeper

Here's some more info on how I'm installing the HaProxy kubernetes-ingress chart. I'm wondering if there's something I could configure, to allow for pod termination without downtime.

chart: "kubernetes-ingress" chart version: "1.21.1" repository: https://haproxytech.github.io/helm-charts namespace: haproxy values:

# Copyright 2019 HAProxy Technologies LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Default values for kubernetes-ingress Chart for HAProxy Ingress Controller
## ref: https://github.com/haproxytech/kubernetes-ingress/tree/master/documentation

podSecurityPolicy:
  annotations: {}
  enabled: false

## Enable RBAC Authorization
## ref: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
rbac:
  create: true


## Configure Service Account
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
serviceAccount:
  create: true
  name:


## Controller default values
controller:
  name: controller
  image:
    repository: haproxytech/kubernetes-ingress    # can be changed to use CE or EE Controller images
    tag: "{{ .Chart.AppVersion }}"
    pullPolicy: IfNotPresent

  ## Deployment or DaemonSet pod mode
  ## ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
  ## ref: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
  kind: Deployment    # can be 'Deployment' or 'DaemonSet'
  replicaCount: null

  ## Running container without root privileges
  unprivileged: false

  ## Pod termination grace period
  terminationGracePeriodSeconds: 60

  ## Private Registry configuration
  imageCredentials:
    registry: null
    username: null
    password: null
  existingImagePullSecret: null

  ## Controller Container listener port configuration
  containerPort:
    http: 80
    https: 443
    stat: 1024

  ## Ingress Class used for ingress.class annotation in multi-ingress environments
  ingressClass: haproxy   # typically "haproxy" or null to receive all events

  ## Additional labels to add to the deployment or daemonset metadata
  # extraLabels: {}

  ## Additional labels to add to the pod container metadata
  # podLabels: {}

  ## Additional annotations to add to the pod container metadata
  podAnnotations:
    # Setting the source: haproxy configures Datadog to parse the logs
    # The exclude_success_calls regex prevents 2xx and 3xx traffic logs from being sent to Datadog
    ad.datadoghq.com/kubernetes-ingress-controller.logs: |-
      [{
        "source": "haproxy",
        "service": "ingress",
        "log_processing_rules": [{
          "type": "exclude_at_match",
          "name": "exclude_success_calls",
          "pattern" : "\\d+\\/\\d+\\/\\d+\\/\\d+\\/\\d+ [23]\\d\\d"
        }]
      }]

  ## Ingress TLS secret, if it is enabled and secret is null then controller will use auto-generated secret, otherwise
  ## secret needs to contain name of the Secret object which has been created manually
  defaultTLSSecret:
    enabled: true
    secretNamespace: "haproxy"
    secret: "haproxy-tls"

  ## Compute Resources for controller container
  resources:
    limits:
      cpu: 200m
      memory: 384Mi
    requests:
      cpu: 100m
      memory: 192Mi

  ## Horizontal Pod Scaler
  ## Only to be used with Deployment kind
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetCPUUtilizationPercentage: 50
    # targetMemoryUtilizationPercentage: 80

  ## Pod Disruption Budget
  ## Only to be used with Deployment kind
  PodDisruptionBudget:
    enable: true
    maxUnavailable: 50%
    # minAvailable: 1

  ## Pod Node assignment
  # nodeSelector: {}

  ## Node Taints and Tolerations for pod-node cheduling through attraction/repelling
  # tolerations: []

  ## Node Affinity for pod-node scheduling constraints
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - kubernetes-ingress
        topologyKey: kubernetes.io/hostname

  ## Topology spread constraints (only used in kind: Deployment)
  # topologySpreadConstraints: []

  ## Pod DNS Config
  # dnsConfig: {}

  ## Pod DNS Policy
  ## Change this to ClusterFirstWithHostNet in case you have useHostNetwork set to true
  dnsPolicy: ClusterFirst

  ## Additional command line arguments to pass to Controller
  # extraArgs: []

  ## Custom configuration for Controller
  config:
    rate-limit: "ON"
    ssl-redirect: "true"

  ## Controller Logging configuration
  logging:
    ## Controller logging level
    ## This only relevant to Controller logs
    level: info

    ## HAProxy traffic logs
    traffic:
      address:  "stdout"
      format:   "raw"
      facility: "daemon"
      level:    "info"

  ## Mirrors the address of the service's endpoints to the
  ## load-balancer status of all Ingress objects it satisfies.
  publishService:
    enabled: false
    ##
    ## Override of the publish service
    ## Must be <namespace>/<service_name>
    pathOverride: ""

  ## Controller Service configuration
  ## ref: https://kubernetes.io/docs/concepts/services-networking/service/
  service:
    enabled: true     # set to false when controller.kind is 'DaemonSet' and controller.daemonset.useHostPorts is true
    type: LoadBalancer   # can be 'NodePort' or 'LoadBalancer'

    ## Service annotations
    ## ref: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
      service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"

    ## Service labels
    # labels: {}

    ## Health check node port
    healthCheckNodePort: 0

    ## Service nodePorts to use for http, https and stat
    ## ref: https://kubernetes.io/docs/concepts/services-networking/service/
    ## If empty, random ports will be used
    nodePorts: {}
    #  http: 31080
    #  https: 31443
    #  stat: 31024

    ## Service ports to use for http, https and stat
    ## ref: https://kubernetes.io/docs/concepts/services-networking/service/
    ports:
      http: 80
      https: 443
      stat: 1024

    ## The controller service ports for http, https and stat can be disabled by
    ## setting below to false - this could be useful when only deploying haproxy
    ## as a TCP loadbalancer
    ## Note: At least one port (http, https, stat or from tcpPorts) has to be enabled
    enablePorts:
      http: true
      https: true
      stat: true

    ## Target port mappings for http, https and stat
    targetPorts:
      http: http
      https: https
      stat: stat

    ## Additional tcp ports to expose
    ## This is especially useful for TCP services:
    # tcpPorts: []

    ## Set external traffic policy
    ## Default is "Cluster", setting it to "Local" preserves source IP
    externalTrafficPolicy: "Local"

    ## Expose service via external IPs that route to one or more cluster nodes
    externalIPs: []

    ## LoadBalancer IP
    ## ref: https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer
    loadBalancerIP: ""

    ## Source IP ranges permitted to access Network Load Balancer
    # ref: https://kubernetes.io/docs/tasks/access-application-cluster/configure-cloud-provider-firewall/
    loadBalancerSourceRanges: [1.2.3.4/32]

    ## Service ClusterIP
    # clusterIP: ""

    ## Service session affinity
    # sessionAffinity: ""

  ## Controller DaemonSet configuration
  ## ref: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
  daemonset:
    useHostNetwork: false  # also modify dnsPolicy accordingly
    useHostPort: false
    hostPorts:
      http: 80
      https: 443
      stat: 1024

  ## Controller deployment strategy definition
  ## ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
  strategy: {}
  #  rollingUpdate:
  #    maxSurge: 25%
  #    maxUnavailable: 25%
  #  type: RollingUpdate

  ## Controller Pod PriorityClass
  ## ref: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass
  priorityClassName: ""

  ## Controller container lifecycle handlers
  # lifecycle: {}
    ## Example preStop for graceful shutdown
    # preStop:
    #   exec:
    #     command: ["/bin/sh", "-c", "kill -USR1 $(pidof haproxy); while killall -0 haproxy; do sleep 1; done"]

  ## Set additional environment variables
  # extraEnvs: []

  ## Add additional containers
  # extraContainers: []

  ## Additional volumeMounts to the controller main container
  # extraVolumeMounts: []

  ## Additional volumes to the controller pod
  # extraVolumes: []

  ## ServiceMonitor
  ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/getting-started.md
  serviceMonitor:
    ## Toggle the ServiceMonitor, true if you have Prometheus Operator installed and configured
    enabled: false

    ## Specify the labels to add to the ServiceMonitors to be selected for target discovery
    extraLabels: {}

    ## Specify the endpoints
    ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md#servicemonitor
    endpoints:
    - port: stat
      path: /metrics
      scheme: http

## Default 404 backend
defaultBackend:
  enabled: true
  name: default-backend
  replicaCount: 2

  image:
    repository: k8s.gcr.io/defaultbackend-amd64
    tag: 1.5
    pullPolicy: IfNotPresent
    runAsUser: 65534

  ## Compute Resources
  resources:
  #  limits:
  #    cpu: 10m
  #    memory: 16Mi
    requests:
      cpu: 10m
      memory: 16Mi

  ## Horizontal Pod Scaler
  ## Only to be used with Deployment kind
  autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 2
    targetCPUUtilizationPercentage: 80
    # targetMemoryUtilizationPercentage: 80

  ## Listener port configuration
  containerPort: 8080

  ## Pod Node assignment
  # nodeSelector: {}

  ## Node Taints and Tolerations for pod-node cheduling through attraction/repelling
  # tolerations: []

  ## Node Affinity for pod-node scheduling constraints
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - kubernetes-ingress
        topologyKey: kubernetes.io/hostname

  ## Topology spread constraints
  # topologySpreadConstraints: []

  ## Additional labels to add to the pod container metadata
  # podLabels: {}

  ## Additional annotations to add to the pod container metadata
  # podAnnotations: {}

  service:
    ## Service ports
    port: 8080

  ## Configure Service Account
  serviceAccount:
    create: true

  ## Pod PriorityClass
  priorityClassName: ""

  ## Set additional environment variables
  # extraEnvs: []

pnmatich avatar Jun 21 '22 21:06 pnmatich

@pnmatich

you could try with this command. We are using it for our HAProxy (not as Ingress though):

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh","-c","sleep 10; kill -SIGUSR1 $(pidof haproxy)"]

Tested with Fortio; requests have 100% success rate.

Make sure:

  • Sleep time is as long as your http-keep-alive timeout
  • terminationGracePeriodSeconds is >= sleep time + haproxy SIGUSR1 termination time

edit: sorry, forgot "-c" argument in command. Fixed it...

dschuldt avatar Jun 24 '22 05:06 dschuldt

I tried myself with our ingress controller setup; I experience the same issues.

Neither

command: ["/bin/sh","-c","sleep 10; kill -SIGUSR1 $(pidof haproxy)"]

nor

command: ["/bin/sh","-c","s6-svc -1 /var/run/s6/services/haproxy"]

works to enable restarts without connection drops.

dschuldt avatar Jun 24 '22 10:06 dschuldt

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 24 '22 12:07 stale[bot]

Please do not close this issue, stale.

dschuldt avatar Jul 26 '22 05:07 dschuldt

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 30 '22 21:08 stale[bot]

Bumping since it's still an issue and experiencing it in my environment, too.

We're seeing the haproxy process exiting from a SIGKILL instead of gracefully terminating from a SIGTERM during a rollout as expected.

Best guess is that it's an issue related to processes not forwarding signals to child processes correctly, so /usr/local/sbin/haproxy never has a chance to react to the SIGTERM 🤷‍♂️

❯ kubectl exec -n haproxy deploy/haproxy-kubernetes-ingress -- ps -ef
PID   USER     TIME  COMMAND
    1 haproxy   0:00 s6-svscan -t0 /var/run/s6/services
   39 haproxy   0:00 s6-supervise s6-fdholderd
  208 haproxy   0:00 s6-supervise haproxy
  209 haproxy   0:00 s6-supervise ingress-controller
  212 haproxy   0:01 /haproxy-ingress-controller --with-s6-overlay --default-ss
  261 haproxy   0:00 /usr/local/sbin/haproxy -x /var/run/haproxy-runtime-api.so
  267 haproxy   0:05 /usr/local/sbin/haproxy -W -db -m 10364 -f /etc/haproxy/ha
  287 haproxy   0:00 ps -ef

evandam avatar Aug 30 '22 22:08 evandam

Same thing here, it doesn't seem to be shutting down gracefully.

scalp42 avatar Aug 31 '22 19:08 scalp42

I can confirm. We have identified the culprit and the fix is in the queue, being reviewed.

dkorunic avatar Sep 15 '22 19:09 dkorunic

Hey @dkorunic just wanted to bump this one more time since we're eagerly awaiting this release. Do you have an ETA for when we could expect this? Thanks! 🙌

evandam avatar Oct 12 '22 19:10 evandam

@evandam Fix has been commited in https://github.com/haproxytech/kubernetes-ingress/commit/6afd804b0410154daf601fcf3ca5969623aeef89 and the release is incoming, I'll check the exact time frame we expect it to be published.

dkorunic avatar Oct 12 '22 21:10 dkorunic

@evandam It will happen in the next hour or so (it's already in progress) and as soon as IC binary release and IC image has been released, I'll update Helm Chart accordingly.

dkorunic avatar Oct 13 '22 07:10 dkorunic

Incoming in Helm Chart 1.23.2.

dkorunic avatar Oct 13 '22 09:10 dkorunic

Thanks for pushing this one over the finish line @dkorunic!

evandam avatar Oct 13 '22 17:10 evandam