application-gateway-kubernetes-ingress icon indicating copy to clipboard operation
application-gateway-kubernetes-ingress copied to clipboard

Healthprobe misconfiguration.

Open raypettersen opened this issue 3 years ago • 6 comments

Describe the bug Identical sites, and around half of them fall back on defaultprobe-Http instead of their own health probes.

To Reproduce Create multiple sites (they are identical except for site-name) with similar annotations:

    "appgw.ingress.kubernetes.io/health-probe-port"               = "${var.service_port}"
    "appgw.ingress.kubernetes.io/health-probe-hostname"     = "${var.sitename}.${var.environment}.<redacted>.com"
    "appgw.ingress.kubernetes.io/health-probe-path"               = "/api/if/GetSystemInfo"
    "appgw.ingress.kubernetes.io/health-probe-status-codes" = "200-399"

Ingress Controller details

Name:         ingress-appgw-deployment-f9cc497bd-8n84t
Namespace:    kube-system
Priority:     0
Node:         aks-dev002sys-15100766-vmss000000/10.6.8.4
Start Time:   Tue, 20 Sep 2022 13:37:36 +0200
Labels:       app=ingress-appgw
              kubernetes.azure.com/managedby=aks
              pod-template-hash=f9cc497bd
Annotations:  checksum/config: 42fe2e26b1ca8ff7fe6ec8f0d4697d44d670cd0f7745ead73529ffe5730d3bcf
              cluster-autoscaler.kubernetes.io/safe-to-evict: true
              kubernetes.azure.com/metrics-scrape: true
              prometheus.io/path: /metrics
              prometheus.io/port: 8123
              prometheus.io/scrape: true
              resource-id:
                <redacted>
Status:       Running
IP:           10.6.8.68
IPs:
  IP:           10.6.8.68
Controlled By:  ReplicaSet/ingress-appgw-deployment-f9cc497bd
Containers:
  ingress-appgw-container:
    Container ID:   containerd://83c417a73884e9acce56eb17541de27a2bd80a4a467256bb5258cec1018c8bf1
    Image:          mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.5.2
    Image ID:       sha256:5fcab52d0c1da1185d50520ba2703723684331c436c11f990753901c9ce4ce14
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Tue, 20 Sep 2022 13:37:37 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     700m
      memory:  300Mi
    Requests:
      cpu:      100m
      memory:   20Mi
    Liveness:   http-get http://:8123/health/alive delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8123/health/ready delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      ingress-appgw-cm  ConfigMap  Optional: false
    Environment:
      AGIC_POD_NAME:                  ingress-appgw-deployment-f9cc497bd-8n84t (v1:metadata.name)
      AGIC_POD_NAMESPACE:             kube-system (v1:metadata.namespace)
      KUBERNETES_PORT_443_TCP_ADDR:   <redacted>
      KUBERNETES_PORT:                <redacted>
      KUBERNETES_PORT_443_TCP:        <redacted>
      KUBERNETES_SERVICE_HOST:        <redacted>
      AZURE_CLOUD_PROVIDER_LOCATION:  /etc/kubernetes/azure.json
    Mounts:
      /etc/kubernetes/azure.json from cloud-provider-config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zw4dm (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  cloud-provider-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/azure.json
    HostPathType:  File
  kube-api-access-zw4dm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoExecute op=Exists
                             :NoSchedule op=Exists
                             CriticalAddonsOnly op=Exists
Events:                      <none>

Log showing that sites are falling back on the default http probe: image

Portal showing some sites having connected properly with their probe, while others are ending up on the default http probe. image

Needless to say this is creating sporadic issues for us.

raypettersen avatar Nov 21 '22 13:11 raypettersen

No-one has experienced something similar? Currently about half our sites are not using their custom probe, while the rest are. Sites are configured by code, so they are 100% identical except for hostname and a few env parameters which should be irrelevant in this case. Refreshing the probe list shows sporadic movements between the custom probe and defaultprobe-Http. This is making the kubernetes ingress completely unreliable. Logs in kube-system shows ongoing sync operations but nothing obvious error-wise.

raypettersen avatar Nov 23 '22 18:11 raypettersen

I am having a similar situation with configurations being weird. Whenever I deleted the appgw.ingress.kubernetes.io/health-probe-port annotation it didn't delete the change on the azure side. It continued to use reconcile with the previously deleted value.

LockedThread avatar Dec 30 '22 20:12 LockedThread

we are geting a weird similar approach.

we setup the following annotations:

    appgw.ingress.kubernetes.io/health-probe-hostname: "example.domain.com"
    appgw.ingress.kubernetes.io/health-probe-port: "80"
    appgw.ingress.kubernetes.io/health-probe-path: "/"
    appgw.ingress.kubernetes.io/backend-protocol: http

it makes the health probe but it connects to backend to the defaultHealth probe.

Why does it do this? does it take liveslinessProbe on pod take precedence over these annotations?

This creates 2 problems:

  • Health check fails and backends are blocked and thus leading to 502 gateay
  • also the customer-error page is not shown (not configurable as of this moment) giving a really bad customer experience.

joelharkes avatar Jun 09 '23 10:06 joelharkes

we are geting a weird similar approach.

we setup the following annotations:


    appgw.ingress.kubernetes.io/health-probe-hostname: "example.domain.com"

    appgw.ingress.kubernetes.io/health-probe-port: "80"

    appgw.ingress.kubernetes.io/health-probe-path: "/"

    appgw.ingress.kubernetes.io/backend-protocol: http

it makes the health probe but it connects to backend to the defaultHealth probe.

Why does it do this? does it take liveslinessProbe on pod take precedence over these annotations?

This creates 2 problems:

  • Health check fails and backends are blocked and thus leading to 502 gateay

  • also the customer-error page is not shown (not configurable as of this moment) giving a really bad customer experience.

Im going to be honest, I gave up on using this operator and I had to manually set it up in the Azure console. Before giving up I had this as my annotation config:

    appgw.ingress.kubernetes.io/health-probe-path: "/"
    appgw.ingress.kubernetes.io/health-probe-port: "80"
    appgw.ingress.kubernetes.io/appgw-ssl-certificate: "origin"
    appgw.ingress.kubernetes.io/backend-protocol: "http"

LockedThread avatar Jun 10 '23 19:06 LockedThread

@raypettersen did you find a solution?

joelharkes avatar Jun 12 '23 07:06 joelharkes

Yes. The problem occurs when you use default backend in your deployment. The result is a healthcheck going to the appgw default backend.

Remove default backend from your deployment and use the defined backends instead.

raypettersen avatar Jun 12 '23 07:06 raypettersen