application-gateway-kubernetes-ingress
application-gateway-kubernetes-ingress copied to clipboard
AGIC /ready endpoint hangs forever, possibly due to inability to get AppGW configuration
Describe the bug /health/ready endpoint hangs This is similar to #1393, but we have tried running with version 1.5.2 and 1.6.0-rc1. The cluster is running Kubernetes v1.22.6.
To Reproduce
- Follow the instructions as closely as possible in: https://learn.microsoft.com/en-us/azure/developer/terraform/create-k8s-cluster-with-aks-applicationgateway-ingress. Peering between AKS vnet and appgw vnet looks correct.
- The helm chart deployment waits for the AGIC pod to become ready.
- Use kubectl get pod to see the agic pod, and get its cluster IP address from kubectl describe pod
- While the helm chart is waiting for AGIC to become ready, use kubectl exec to enter another pod in the cluster and ensure curl is available.
- Try: curl http://the-agic-IP:8123/health/alive. This will return 200.
- Try: curl http://the-agic-IP:812/health/ready. This will hang.
- Eventually helm install times out.
Ingress Controller details
- Output of
kubectl describe pod <ingress controller> .
Name: agic-ingress-azure-7fb55bc4f4-t4q9w
Namespace: default
Priority: 0
Node: aks-nodepool1-63215908-vmss000000/10.224.0.4
Start Time: Thu, 22 Sep 2022 16:46:10 -0500
Labels: aadpodidbinding=agic-ingress-azure
app=ingress-azure
pod-template-hash=7fb55bc4f4
release=agic
Annotations: checksum/config: 4bfe86c3c4d66766093484c236259205b1a382152ab1e2eeb85d99a0fcaed1a5
prometheus.io/port: 8123
prometheus.io/scrape: true
Status: Running
IP: 10.244.0.16
IPs:
IP: 10.244.0.16
Controlled By: ReplicaSet/agic-ingress-azure-7fb55bc4f4
Containers:
ingress-azure:
Container ID: containerd://5b32f4613d7d22099bed6e170ffa5617280b36a29885e7a35bb0c440f9c85e31
Image: mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.6.0-rc1
Image ID: mcr.microsoft.com/azure-application-gateway/kubernetes-ingress@sha256:7f82f8392e805d09940bb806a45e9391dec39615ed808bc6fc13ac671d34a081
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 22 Sep 2022 16:46:14 -0500
Ready: False
Restart Count: 0
Liveness: http-get http://:8123/health/alive delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8123/health/ready delay=5s timeout=1s period=10s #success=1 #failure=3
Environment Variables from:
agic-cm-ingress-azure ConfigMap Optional: false
Environment:
AZURE_CLOUD_PROVIDER_LOCATION: /etc/appgw/azure.json
AGIC_POD_NAME: agic-ingress-azure-7fb55bc4f4-t4q9w (v1:metadata.name)
AGIC_POD_NAMESPACE: default (v1:metadata.namespace)
Mounts:
/etc/appgw/ from azure (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4qftp (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
azure:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/
HostPathType: Directory
kube-api-access-4qftp:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m24s default-scheduler Successfully assigned default/agic-ingress-azure-7fb55bc4f4-t4q9w to aks-nodepool1-63215908-vmss000000
Normal Pulling 8m24s kubelet Pulling image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.6.0-rc1"
Normal Pulled 8m22s kubelet Successfully pulled image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.6.0-rc1" in 2.543302298s
Normal Created 8m22s kubelet Created container ingress-azure
Normal Started 8m21s kubelet Started container ingress-azure
Warning Unhealthy 3m24s (x33 over 8m14s) kubelet Readiness probe failed: Get "http://10.244.0.16:8123/health/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
- Output of `kubectl logs
. The last line in the log suggests that the AGIC can't talk to the AppGW but is not timing out the request.
I0922 22:09:49.353696 1 utils.go:114] Using verbosity level 10 from environment variable APPGW_VERBOSITY_LEVEL
I0922 22:09:49.355757 1 round_trippers.go:424] curl -k -v -XGET -H "User-Agent: appgw-ingress/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" -H "Authorization: Bearer <masked>" 'https://10.0.0.1:443/version?timeout=32s'
I0922 22:09:49.391264 1 round_trippers.go:444] GET https://10.0.0.1:443/version?timeout=32s 200 OK in 35 milliseconds
I0922 22:09:49.391289 1 round_trippers.go:450] Response Headers:
I0922 22:09:49.391295 1 round_trippers.go:453] Cache-Control: no-cache, private
I0922 22:09:49.391300 1 round_trippers.go:453] Content-Type: application/json
I0922 22:09:49.391304 1 round_trippers.go:453] X-Kubernetes-Pf-Flowschema-Uid: f385a50e-7937-4463-b47a-0cbe53e15cfa
I0922 22:09:49.391309 1 round_trippers.go:453] X-Kubernetes-Pf-Prioritylevel-Uid: d492e718-21e6-432d-8598-1863adb59c66
I0922 22:09:49.391314 1 round_trippers.go:453] Content-Length: 264
I0922 22:09:49.391318 1 round_trippers.go:453] Date: Thu, 22 Sep 2022 22:09:49 GMT
I0922 22:09:49.391323 1 round_trippers.go:453] Audit-Id: 460e3133-3788-4ea4-8798-46b9c8aafaf0
I0922 22:09:49.391374 1 request.go:1097] Response Body: {
"major": "1",
"minor": "22",
"gitVersion": "v1.22.6",
"gitCommit": "ece9ecf2f9aecbd86d3eba31f0be62e4b6353a5a",
"gitTreeState": "clean",
"buildDate": "2022-07-28T23:33:17Z",
"goVersion": "go1.16.12",
"compiler": "gc",
"platform": "linux/amd64"
}
I0922 22:09:49.391471 1 supported_apiversion.go:70] server version is: 1.22.6
I0922 22:09:49.392302 1 round_trippers.go:424] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: appgw-ingress/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" 'https://10.0.0.1:443/api/v1/namespaces/default/pods/agic-ingress-azure-6b9b8b789c-trcsc'
I0922 22:09:49.397333 1 round_trippers.go:444] GET https://10.0.0.1:443/api/v1/namespaces/default/pods/agic-ingress-azure-6b9b8b789c-trcsc 200 OK in 5 milliseconds
I0922 22:09:49.397352 1 round_trippers.go:450] Response Headers:
I0922 22:09:49.397357 1 round_trippers.go:453] Audit-Id: cc967ae5-2004-4939-afbb-fd89ff71582a
I0922 22:09:49.397362 1 round_trippers.go:453] Cache-Control: no-cache, private
I0922 22:09:49.397367 1 round_trippers.go:453] Content-Type: application/json
I0922 22:09:49.397372 1 round_trippers.go:453] X-Kubernetes-Pf-Flowschema-Uid: f385a50e-7937-4463-b47a-0cbe53e15cfa
I0922 22:09:49.397378 1 round_trippers.go:453] X-Kubernetes-Pf-Prioritylevel-Uid: d492e718-21e6-432d-8598-1863adb59c66
I0922 22:09:49.397382 1 round_trippers.go:453] Date: Thu, 22 Sep 2022 22:09:49 GMT
I0922 22:09:49.397536 1 request.go:1097] Response Body: {"kind":"Pod","apiVersion":"v1","metadata":{"name":"agic-ingress-azure-6b9b8b789c-trcsc","generateName":"agic-ingress-azure-6b9b8b789c-","namespace":"default","uid":"931847c7-d0a7-49da-a93a-37e475180852","resourceVersion":"2388403","creationTimestamp":"2022-09-22T22:09:48Z","labels":{"aadpodidbinding":"agic-ingress-azure","app":"ingress-azure","pod-template-hash":"6b9b8b789c","release":"agic"},"annotations":{"checksum/config":"57e63b2b8011ff094073545be68ff72743ec19f6f516f4ed67ec9cedffad7de2","prometheus.io/port":"8123","prometheus.io/scrape":"true"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"agic-ingress-azure-6b9b8b789c","uid":"3ec0891e-0ef0-4309-8b7b-df073a209702","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"v1","time":"2022-09-22T22:09:48Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:checksum/config":{},"f:prometheus.io/port":{},"f:prometheus.io/scrape":{}},"f:generateName":{},"f:labels":{".":{},"f:aadpodidbinding":{},"f:app":{},"f:pod-template-hash":{},"f:release":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"3ec0891e-0ef0-4309-8b7b-df073a209702\"}":{}}},"f:spec":{"f:containers":{"k:{\"name\":\"ingress-azure\"}":{".":{},"f:env":{".":{},"k:{\"name\":\"AGIC_POD_NAME\"}":{".":{},"f:name":{},"f:valueFrom":{".":{},"f:fieldRef":{}}},"k:{\"name\":\"AGIC_POD_NAMESPACE\"}":{".":{},"f:name":{},"f:valueFrom":{".":{},"f:fieldRef":{}}},"k:{\"name\":\"AZURE_CLOUD_PROVIDER_LOCATION\"}":{".":{},"f:name":{},"f:value":{}}},"f:envFrom":{},"f:image":{},"f:imagePullPolicy":{},"f:livenessProbe":{".":{},"f:failureThreshold":{},"f:httpGet":{".":{},"f:path":{},"f:port":{},"f:scheme":{}},"f:initialDelaySeconds":{},"f:periodSeconds":{},"f:successThreshold":{},"f:timeoutSeconds":{}},"f:name":{},"f:readinessProbe":{".":{},"f:failureThreshold":{},"f:httpGet":{".":{},"f:path":{},"f:port":{},"f:scheme":{}},"f:initialDelaySeconds":{},"f:periodSeconds":{},"f:successThreshold":{},"f:timeoutSeconds":{}},"f:resources":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{},"f:volumeMounts":{".":{},"k:{\"mountPath\":\"/etc/appgw/\"}":{".":{},"f:mountPath":{},"f:name":{},"f:readOnly":{}}}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{".":{},"f:runAsUser":{}},"f:serviceAccount":{},"f:serviceAccountName":{},"f:terminationGracePeriodSeconds":{},"f:volumes":{".":{},"k:{\"name\":\"azure\"}":{".":{},"f:hostPath":{".":{},"f:path":{},"f:type":{}},"f:name":{}}}}}},{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2022-09-22T22:09:48Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{"k:{\"type\":\"ContainersReady\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:message":{},"f:reason":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Initialized\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Ready\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:message":{},"f:reason":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:startTime":{}}},"subresource":"status"}]},"spec":{"volumes":[{"name":"azure","hostPath":{"path":"/etc/kubernetes/","type":"Directory"}},{"name":"kube-api-access-xxnj5","projected":{"sources":[{"serviceAccountToken":{"expirationSeconds":3607,"path":"token"}},{"configMap":{"name":"kube-root-ca.crt","items":[{"key":"ca.crt","path":"ca.crt"}]}},{"downwardAPI":{"items":[{"path":"namespace","fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}]}}],"defaultMode":420}}],"containers":[{"name":"ingress-azure","image":"mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.6.0-rc1","envFrom":[{"configMapRef":{"name":"agic-cm-ingress-azure"}}],"env":[{"name":"AZURE_CLOUD_PROVIDER_LOCATION","value":"/etc/appgw/azure.json"},{"name":"AGIC_POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"AGIC_POD_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}}],"resources":{},"volumeMounts":[{"name":"azure","readOnly":true,"mountPath":"/etc/appgw/"},{"name":"kube-api-access-xxnj5","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"livenessProbe":{"httpGet":{"path":"/health/alive","port":8123,"scheme":"HTTP"},"initialDelaySeconds":15,"timeoutSeconds":1,"periodSeconds":20,"successThreshold":1,"failureThreshold":3},"readinessProbe":{"httpGet":{"path":"/health/ready","port":8123,"scheme":"HTTP"},"initialDelaySeconds":5,"timeoutSeconds":1,"periodSeconds":10,"successThreshold":1,"failureThreshold":3},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"Always"}],"restartPolicy":"Always","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","serviceAccountName":"agic-sa-ingress-azure","serviceAccount":"agic-sa-ingress-azure","nodeName":"aks-nodepool1-63215908-vmss000001","securityContext":{"runAsUser":0},"schedulerName":"default-scheduler","tolerations":[{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute","tolerationSeconds":300},{"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute","tolerationSeconds":300}],"priority":0,"enableServiceLinks":true,"preemptionPolicy":"PreemptLowerPriority"},"status":{"phase":"Pending","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2022-09-22T22:09:48Z"},{"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2022-09-22T22:09:48Z","reason":"ContainersNotReady","message":"containers with unready status: [ingress-azure]"},{"type":"ContainersReady","status":"False","lastProbeTime":null,"lastTransitionTime":"2022-09-22T22:09:48Z","reason":"ContainersNotReady","message":"containers with unready status: [ingress-azure]"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2022-09-22T22:09:48Z"}],"hostIP":"10.224.0.5","startTime":"2022-09-22T22:09:48Z","containerStatuses":[{"name":"ingress-azure","state":{"waiting":{"reason":"ContainerCreating"}},"lastState":{},"ready":false,"restartCount":0,"image":"mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.6.0-rc1","imageID":"","started":false}],"qosClass":"BestEffort"}}
I0922 22:09:49.404718 1 environment.go:294] KUBERNETES_WATCHNAMESPACE is not set. Watching all available namespaces.
I0922 22:09:49.404744 1 main.go:118] Using User Agent Suffix='agic-ingress-azure-6b9b8b789c-trcsc' when communicating with ARM
I0922 22:09:49.404832 1 main.go:137] Application Gateway Details: Subscription="e7d1e563-b71b-41bc-bcf0-b2d07ae8b2f1" Resource Group="AthenaRG" Name="ApplicationGateway1"
I0922 22:09:49.404839 1 auth.go:57] Creating authorizer from Azure Managed Service Identity
I0922 22:09:49.404946 1 httpserver.go:57] Starting API Server on :8123
I0922 22:09:49.405785 1 client.go:129] Getting Application Gateway configuration.
- Any Azure support tickets associated with this issue.
after upping verbosity level to > 3, I saw this:
E0923 00:55:24.401136 1 client.go:186] Code="ErrorApplicationGatewayNotFound" Message="Unexpected status code '404' while performing a GET on Application Gateway." InnerError="azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/e7d1e563-b71b-41bc-bcf0-b2d07ae8b2f1/resourceGroups/AthenaRG/providers/Microsoft.Network/applicationGateways/ApplicationGateway1?api-version=2021-03-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: getting assigned identities for pod default/agic-ingress-azure-7c4b5998d9-4779v in ASSIGNED state failed after 20 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=341f9e84-b5c2-4b14-8668-713989321fcf&resource=https%!A(MISSING)%!F(MISSING)%!F(MISSING)management.azure.com%!F(MISSING)"
F0923 00:55:24.401236 1 main.go:175] Failed getting Application Gateway: Code="ErrorApplicationGatewayNotFound" Message="Unexpected status code '404' while performing a GET on Application Gateway." InnerError="azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/e7d1e563-b71b-41bc-bcf0-b2d07ae8b2f1/resourceGroups/AthenaRG/providers/Microsoft.Network/applicationGateways/ApplicationGateway1?api-version=2021-03-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: getting assigned identities for pod default/agic-ingress-azure-7c4b5998d9-4779v in ASSIGNED state failed after 20 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=341f9e84-b5c2-4b14-8668-713989321fcf&resource=https%3A%2F%2Fmanagement.azure.com%2F"
For what it's worth, I was subsequently able to make the AGIC work by using the Azure CLI command az aks addon enable --addon ingress-appgw, providing the ID of the app gateway as the --appgw-id. This was the same app gateway as I tried to use with the AGIC Helm chart.
I have noticed when comparing the environment variables that the AGIC helm chart sets up substantially different environment variables compared to the addon command, and I wonder if one combination of environment variables drives the ingress controller to behave very differently than the other.
Are there any new insights? I face the exact same problem. I dont want to enable the addon as multiple clusters need to use the same application gateway (which is not supported by the add on).