application-gateway-kubernetes-ingress
application-gateway-kubernetes-ingress copied to clipboard
HTTPS listener gets deleted after agic restart
Describe the bug An existing HTTPS setup gets deleted, when the agic gets restarted with a short downtime. The ingress definition was valid and working before. After the restart, the HTTPS Listener configuration is removed from the AppGW. A delete + redeploy of the ingress object then creates the full setup including HTTPS again. This bug occured with v1.5.2 after a kubernetes upgrade to v1.23.5 and a migration to the ingress v1 resource version. This takes our service completely offline, as existing URLs pointing to https not being answered by the AppGW.
To Reproduce Steps to reproduce the behavior:
- Create tls secret in the service namespace
- Create ingress resource
ingress definition
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
appgw.ingress.kubernetes.io/cookie-based-affinity: "true"
appgw.ingress.kubernetes.io/health-probe-path: /foo
appgw.ingress.kubernetes.io/request-timeout: "300"
appgw.ingress.kubernetes.io/ssl-redirect: "true"
generation: 1
labels:
app: appgw-ingress
name: appgw-ingress
namespace: xxx-dev
spec:
ingressClassName: azure-application-gateway
rules:
- host: xxx.dev.apps.xxx.com
http:
paths:
- backend:
service:
name: service2
port:
number: 80
path: /foo/bar/*
pathType: ImplementationSpecific
- backend:
service:
name: app
port:
number: 80
path: /*
pathType: ImplementationSpecific
tls:
- secretName: xxx.apps.xxx.com
-
Scale the agic deployment to 0 and wait one minute
-
Scale the agic deployment back to 1 => agic deletes the HTTPS listener configuration
-
Delete the ingress object
-
Create ingress object again => Correct configuration gets created again
Ingress Controller details
- Output of
kubectl describe pod <ingress controller>
agic describe
Name: app-gw-ingress-controller-ingress-azure-f8b974cd7-96xq6
Namespace: kube-system
Priority: 0
Node: aks-application-15814207-vmss000002/10.250.200.25
Start Time: Tue, 31 May 2022 13:42:11 +0200
Labels: app=ingress-azure
pod-template-hash=f8b974cd7
release=app-gw-ingress-controller
Annotations: checksum/config: 68b1f1769142045fe18ea93367c9e93c96adb69db9db8917fc4a0a80aa31fa6c
cni.projectcalico.org/containerID: bb60f4fd5327e557fddf2a78f01b9649a0dd714dcdca17522ff7768463c8e91b
cni.projectcalico.org/podIP: 10.245.5.24/32
cni.projectcalico.org/podIPs: 10.245.5.24/32
kubectl.kubernetes.io/restartedAt: 2022-05-31T13:40:52+02:00
prometheus.io/port: 8123
prometheus.io/scrape: true
Status: Running
IP: 10.245.5.24
IPs:
IP: 10.245.5.24
Controlled By: ReplicaSet/app-gw-ingress-controller-ingress-azure-f8b974cd7
Containers:
ingress-azure:
Container ID: containerd://3856914334ad02c5fea8522f9b1d337271617360875392f02c4faaa741f0121d
Image: mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.5.2
Image ID: mcr.microsoft.com/azure-application-gateway/kubernetes-ingress@sha256:69a8f8ea51e71e67041323668ca3b250f4316147d8872c26e6bd12d032b2fa06
Port: <none>
Host Port: <none>
State: Running
Started: Tue, 31 May 2022 13:42:12 +0200
Ready: True
Restart Count: 0
Liveness: http-get http://:8123/health/alive delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8123/health/ready delay=5s timeout=1s period=10s #success=1 #failure=3
Environment Variables from:
app-gw-ingress-controller-cm-ingress-azure ConfigMap Optional: false
Environment:
AZURE_CLOUD_PROVIDER_LOCATION: /etc/appgw/azure.json
AGIC_POD_NAME: app-gw-ingress-controller-ingress-azure-f8b974cd7-96xq6 (v1:metadata.name)
AGIC_POD_NAMESPACE: kube-system (v1:metadata.namespace)
AZURE_AUTH_LOCATION: /etc/Azure/Networking-AppGW/auth/armAuth.json
KUBERNETES_PORT_443_TCP_ADDR: xxxx-dev-xxxx.hcp.westeurope.azmk8s.io
KUBERNETES_PORT: tcp://xxxx-dev-xxxx.hcp.westeurope.azmk8s.io:443
KUBERNETES_PORT_443_TCP: tcp://xxxx-dev-xxxx.hcp.westeurope.azmk8s.io:443
KUBERNETES_SERVICE_HOST: xxxx-dev-xxxx.hcp.westeurope.azmk8s.io
Mounts:
/etc/Azure/Networking-AppGW/auth from networking-appgw-k8s-azure-service-principal-mount (ro)
/etc/appgw/ from azure (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fmzz8 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
azure:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/
HostPathType: Directory
networking-appgw-k8s-azure-service-principal-mount:
Type: Secret (a volume populated by a Secret)
SecretName: networking-appgw-k8s-azure-service-principal
Optional: false
kube-api-access-fmzz8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 48m default-scheduler Successfully assigned kube-system/app-gw-ingress-controller-ingress-azure-f8b974cd7-96xq6 to aks-application-15814207-vmss000002
Normal Pulling 48m kubelet Pulling image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.5.2"
Normal Pulled 48m kubelet Successfully pulled image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.5.2" in 197.58012ms
Normal Created 48m kubelet Created container ingress-azure
Normal Started 48m kubelet Started container ingress-azure
- Any Azure support tickets associated with this issue. 2205310050001841
Removing the appgw.ingress.kubernetes.io/ssl-redirect: "true" annotation seems to have reduced the likelihood of this error significantly.
After removing it, I was only able to trigger this issue once by killing the agic and that might have been an artifact of the recent change.
It seems the culprit is having the spec: ingressClassName: azure-application-gateway
If i use the annotation kubernetes.io/ingress.class: azure/application-gateway I can't reproduce the behavior.
I was able to repro this issue by following the above steps. It is happening when using ingressClassName and not when using the annotation.
I think that the bug is related to how AGIC filters Ingress resource using the IngressClass resource. When controller starts, it warms up the informer Caches - i.e. waits for receiving ingress, ingress class, secrets, etc. When an ingress is received, it processes the ingress to find the secrets to listen for. For filtering the ingresses, AGIC uses the IngressClass resource from the Informer cache instead of getting it from the cluster. So, when ingress class is not yet synced and an ingress resource is received, AGIC doesn't process the associated secrets and removes the listener.
I am preparing a fix for this.