application-gateway-kubernetes-ingress icon indicating copy to clipboard operation
application-gateway-kubernetes-ingress copied to clipboard

HTTPS listener gets deleted after agic restart

Open itd-fsc opened this issue 3 years ago • 3 comments
trafficstars

Describe the bug An existing HTTPS setup gets deleted, when the agic gets restarted with a short downtime. The ingress definition was valid and working before. After the restart, the HTTPS Listener configuration is removed from the AppGW. A delete + redeploy of the ingress object then creates the full setup including HTTPS again. This bug occured with v1.5.2 after a kubernetes upgrade to v1.23.5 and a migration to the ingress v1 resource version. This takes our service completely offline, as existing URLs pointing to https not being answered by the AppGW.

To Reproduce Steps to reproduce the behavior:

  1. Create tls secret in the service namespace
  2. Create ingress resource
ingress definition
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    appgw.ingress.kubernetes.io/cookie-based-affinity: "true"
    appgw.ingress.kubernetes.io/health-probe-path: /foo
    appgw.ingress.kubernetes.io/request-timeout: "300"
    appgw.ingress.kubernetes.io/ssl-redirect: "true"
  generation: 1
  labels:
    app: appgw-ingress
  name: appgw-ingress
  namespace: xxx-dev
spec:
  ingressClassName: azure-application-gateway
  rules:
  - host: xxx.dev.apps.xxx.com
    http:
      paths:
      - backend:
          service:
            name: service2
            port:
              number: 80
        path: /foo/bar/*
        pathType: ImplementationSpecific
      - backend:
          service:
            name: app
            port:
              number: 80
        path: /*
        pathType: ImplementationSpecific
  tls:
  - secretName: xxx.apps.xxx.com
=> Correct AppGW configuration gets created, including the HTTPS listener
  1. Scale the agic deployment to 0 and wait one minute

  2. Scale the agic deployment back to 1 => agic deletes the HTTPS listener configuration

  3. Delete the ingress object

  4. Create ingress object again => Correct configuration gets created again

Ingress Controller details

  • Output of kubectl describe pod <ingress controller>
agic describe
Name:         app-gw-ingress-controller-ingress-azure-f8b974cd7-96xq6
Namespace:    kube-system
Priority:     0
Node:         aks-application-15814207-vmss000002/10.250.200.25
Start Time:   Tue, 31 May 2022 13:42:11 +0200
Labels:       app=ingress-azure
              pod-template-hash=f8b974cd7
              release=app-gw-ingress-controller
Annotations:  checksum/config: 68b1f1769142045fe18ea93367c9e93c96adb69db9db8917fc4a0a80aa31fa6c
              cni.projectcalico.org/containerID: bb60f4fd5327e557fddf2a78f01b9649a0dd714dcdca17522ff7768463c8e91b
              cni.projectcalico.org/podIP: 10.245.5.24/32
              cni.projectcalico.org/podIPs: 10.245.5.24/32
              kubectl.kubernetes.io/restartedAt: 2022-05-31T13:40:52+02:00
              prometheus.io/port: 8123
              prometheus.io/scrape: true
Status:       Running
IP:           10.245.5.24
IPs:
  IP:           10.245.5.24
Controlled By:  ReplicaSet/app-gw-ingress-controller-ingress-azure-f8b974cd7
Containers:
  ingress-azure:
    Container ID:   containerd://3856914334ad02c5fea8522f9b1d337271617360875392f02c4faaa741f0121d
    Image:          mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.5.2
    Image ID:       mcr.microsoft.com/azure-application-gateway/kubernetes-ingress@sha256:69a8f8ea51e71e67041323668ca3b250f4316147d8872c26e6bd12d032b2fa06
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Tue, 31 May 2022 13:42:12 +0200
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:8123/health/alive delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:      http-get http://:8123/health/ready delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      app-gw-ingress-controller-cm-ingress-azure  ConfigMap  Optional: false
    Environment:
      AZURE_CLOUD_PROVIDER_LOCATION:  /etc/appgw/azure.json
      AGIC_POD_NAME:                  app-gw-ingress-controller-ingress-azure-f8b974cd7-96xq6 (v1:metadata.name)
      AGIC_POD_NAMESPACE:             kube-system (v1:metadata.namespace)
      AZURE_AUTH_LOCATION:            /etc/Azure/Networking-AppGW/auth/armAuth.json
      KUBERNETES_PORT_443_TCP_ADDR:   xxxx-dev-xxxx.hcp.westeurope.azmk8s.io
      KUBERNETES_PORT:                tcp://xxxx-dev-xxxx.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:        tcp://xxxx-dev-xxxx.hcp.westeurope.azmk8s.io:443
      KUBERNETES_SERVICE_HOST:        xxxx-dev-xxxx.hcp.westeurope.azmk8s.io
    Mounts:
      /etc/Azure/Networking-AppGW/auth from networking-appgw-k8s-azure-service-principal-mount (ro)
      /etc/appgw/ from azure (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fmzz8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  azure:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/
    HostPathType:  Directory
  networking-appgw-k8s-azure-service-principal-mount:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  networking-appgw-k8s-azure-service-principal
    Optional:    false
  kube-api-access-fmzz8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:

  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  48m   default-scheduler  Successfully assigned kube-system/app-gw-ingress-controller-ingress-azure-f8b974cd7-96xq6 to aks-application-15814207-vmss000002
  Normal  Pulling    48m   kubelet            Pulling image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.5.2"
  Normal  Pulled     48m   kubelet            Successfully pulled image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.5.2" in 197.58012ms
  Normal  Created    48m   kubelet            Created container ingress-azure
  Normal  Started    48m   kubelet            Started container ingress-azure


* Output of `kubectl logs .

agic-log.txt

  • Any Azure support tickets associated with this issue. 2205310050001841

itd-fsc avatar May 31 '22 12:05 itd-fsc

Removing the appgw.ingress.kubernetes.io/ssl-redirect: "true" annotation seems to have reduced the likelihood of this error significantly. After removing it, I was only able to trigger this issue once by killing the agic and that might have been an artifact of the recent change.

itd-fsc avatar Jun 01 '22 09:06 itd-fsc

It seems the culprit is having the spec: ingressClassName: azure-application-gateway

If i use the annotation kubernetes.io/ingress.class: azure/application-gateway I can't reproduce the behavior.

mracfa avatar Jun 08 '22 14:06 mracfa

I was able to repro this issue by following the above steps. It is happening when using ingressClassName and not when using the annotation. I think that the bug is related to how AGIC filters Ingress resource using the IngressClass resource. When controller starts, it warms up the informer Caches - i.e. waits for receiving ingress, ingress class, secrets, etc. When an ingress is received, it processes the ingress to find the secrets to listen for. For filtering the ingresses, AGIC uses the IngressClass resource from the Informer cache instead of getting it from the cluster. So, when ingress class is not yet synced and an ingress resource is received, AGIC doesn't process the associated secrets and removes the listener.

I am preparing a fix for this.

akshaysngupta avatar Jun 21 '22 22:06 akshaysngupta