application-gateway-kubernetes-ingress
application-gateway-kubernetes-ingress copied to clipboard
Healthprobe misconfiguration.
Describe the bug Identical sites, and around half of them fall back on defaultprobe-Http instead of their own health probes.
To Reproduce Create multiple sites (they are identical except for site-name) with similar annotations:
"appgw.ingress.kubernetes.io/health-probe-port" = "${var.service_port}"
"appgw.ingress.kubernetes.io/health-probe-hostname" = "${var.sitename}.${var.environment}.<redacted>.com"
"appgw.ingress.kubernetes.io/health-probe-path" = "/api/if/GetSystemInfo"
"appgw.ingress.kubernetes.io/health-probe-status-codes" = "200-399"
Ingress Controller details
Name: ingress-appgw-deployment-f9cc497bd-8n84t
Namespace: kube-system
Priority: 0
Node: aks-dev002sys-15100766-vmss000000/10.6.8.4
Start Time: Tue, 20 Sep 2022 13:37:36 +0200
Labels: app=ingress-appgw
kubernetes.azure.com/managedby=aks
pod-template-hash=f9cc497bd
Annotations: checksum/config: 42fe2e26b1ca8ff7fe6ec8f0d4697d44d670cd0f7745ead73529ffe5730d3bcf
cluster-autoscaler.kubernetes.io/safe-to-evict: true
kubernetes.azure.com/metrics-scrape: true
prometheus.io/path: /metrics
prometheus.io/port: 8123
prometheus.io/scrape: true
resource-id:
<redacted>
Status: Running
IP: 10.6.8.68
IPs:
IP: 10.6.8.68
Controlled By: ReplicaSet/ingress-appgw-deployment-f9cc497bd
Containers:
ingress-appgw-container:
Container ID: containerd://83c417a73884e9acce56eb17541de27a2bd80a4a467256bb5258cec1018c8bf1
Image: mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.5.2
Image ID: sha256:5fcab52d0c1da1185d50520ba2703723684331c436c11f990753901c9ce4ce14
Port: <none>
Host Port: <none>
State: Running
Started: Tue, 20 Sep 2022 13:37:37 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 700m
memory: 300Mi
Requests:
cpu: 100m
memory: 20Mi
Liveness: http-get http://:8123/health/alive delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8123/health/ready delay=5s timeout=1s period=10s #success=1 #failure=3
Environment Variables from:
ingress-appgw-cm ConfigMap Optional: false
Environment:
AGIC_POD_NAME: ingress-appgw-deployment-f9cc497bd-8n84t (v1:metadata.name)
AGIC_POD_NAMESPACE: kube-system (v1:metadata.namespace)
KUBERNETES_PORT_443_TCP_ADDR: <redacted>
KUBERNETES_PORT: <redacted>
KUBERNETES_PORT_443_TCP: <redacted>
KUBERNETES_SERVICE_HOST: <redacted>
AZURE_CLOUD_PROVIDER_LOCATION: /etc/kubernetes/azure.json
Mounts:
/etc/kubernetes/azure.json from cloud-provider-config (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zw4dm (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
cloud-provider-config:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/azure.json
HostPathType: File
kube-api-access-zw4dm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
:NoSchedule op=Exists
CriticalAddonsOnly op=Exists
Events: <none>
Log showing that sites are falling back on the default http probe:

Portal showing some sites having connected properly with their probe, while others are ending up on the default http probe.

Needless to say this is creating sporadic issues for us.
No-one has experienced something similar? Currently about half our sites are not using their custom probe, while the rest are. Sites are configured by code, so they are 100% identical except for hostname and a few env parameters which should be irrelevant in this case. Refreshing the probe list shows sporadic movements between the custom probe and defaultprobe-Http. This is making the kubernetes ingress completely unreliable. Logs in kube-system shows ongoing sync operations but nothing obvious error-wise.
I am having a similar situation with configurations being weird. Whenever I deleted the appgw.ingress.kubernetes.io/health-probe-port annotation it didn't delete the change on the azure side. It continued to use reconcile with the previously deleted value.
we are geting a weird similar approach.
we setup the following annotations:
appgw.ingress.kubernetes.io/health-probe-hostname: "example.domain.com"
appgw.ingress.kubernetes.io/health-probe-port: "80"
appgw.ingress.kubernetes.io/health-probe-path: "/"
appgw.ingress.kubernetes.io/backend-protocol: http
it makes the health probe but it connects to backend to the defaultHealth probe.
Why does it do this? does it take liveslinessProbe on pod take precedence over these annotations?
This creates 2 problems:
- Health check fails and backends are blocked and thus leading to 502 gateay
- also the customer-error page is not shown (not configurable as of this moment) giving a really bad customer experience.
we are geting a weird similar approach.
we setup the following annotations:
appgw.ingress.kubernetes.io/health-probe-hostname: "example.domain.com" appgw.ingress.kubernetes.io/health-probe-port: "80" appgw.ingress.kubernetes.io/health-probe-path: "/" appgw.ingress.kubernetes.io/backend-protocol: httpit makes the health probe but it connects to backend to the defaultHealth probe.
Why does it do this? does it take liveslinessProbe on pod take precedence over these annotations?
This creates 2 problems:
Health check fails and backends are blocked and thus leading to 502 gateay
also the customer-error page is not shown (not configurable as of this moment) giving a really bad customer experience.
Im going to be honest, I gave up on using this operator and I had to manually set it up in the Azure console. Before giving up I had this as my annotation config:
appgw.ingress.kubernetes.io/health-probe-path: "/"
appgw.ingress.kubernetes.io/health-probe-port: "80"
appgw.ingress.kubernetes.io/appgw-ssl-certificate: "origin"
appgw.ingress.kubernetes.io/backend-protocol: "http"
@raypettersen did you find a solution?
Yes. The problem occurs when you use default backend in your deployment. The result is a healthcheck going to the appgw default backend.
Remove default backend from your deployment and use the defined backends instead.