application-gateway-kubernetes-ingress icon indicating copy to clipboard operation
application-gateway-kubernetes-ingress copied to clipboard

Critical errors are not retried/terminating (especially in ApplicationGatewaysClient#CreateOrUpdate)

Open ohadschn opened this issue 3 years ago • 1 comments

Describe the bug Looking at the worker code, it seems that errors are never retried, just logged: https://github.com/Azure/application-gateway-kubernetes-ingress/blob/b01f82a6acdd75e8b42e3b974bbd0489167f8468/pkg/worker/worker.go#L62

The problem is that some of these errors might be critical, specifically errors encountered in ApplicationGatewaysClient#CreateOrUpdate which basically mean the app gateway hasn't been updated. That in turn could easily mean that all the backend pools are down and essentially nothing works (which was the case for us). Specifically, we encountered the following error:

network.ApplicationGatewaysClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="AuthorizationFailed" Message="The client 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' does not have authorization to perform action 'Microsoft.Network/applicationGateways/write' over scope '/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/foo/providers/Microsoft.Network/applicationGateways/foo' or the scope is invalid. If access was recently granted, please refresh your credentials."

We later granted the proper permissions to the AGIC identity, but since the above was never retried, it didn't help and we had to manually kill the AGIC pod.

In fact I IMHO critical errors like this should crash the container:

  1. The bad state will be emphasized - the pod would be in a crash loop backoff, so it will show in dashboards, metrics, etc.
  2. You would get the retry for free (the pod will restart the container upon each crash)
  3. The relevant error will be very easy to spot - it would simply be the last line in kubectl logs -p (since it caused the crash)

To Reproduce Steps to reproduce the behavior:

  1. Remove all role assignments (specifically App Gateway Reader & Contributor ) from the AGIC identity AKA aksCluster.properties.addonProfiles.ingressApplicationGateway.identity
  2. Make some update to the ingress and wait for AGIC to pick up the event (handling will fail on AuthorizationFailed)
  3. Restore role assignments for the AGIC identity

At this point the AGIC identity has the proper permissions to modify the gateway, but it never will (unless some relevant resource that is monitored by the informers changes).

Ingress Controller details Output of kubectl logs <ingress controller>:

E0502 15:44:20.509333       1 controller.go:141] network.ApplicationGatewaysClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="AuthorizationFailed" Message="The client 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' does not have authorization to perform action 'Microsoft.Network/applicationGateways/write' over scope '/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/foo/providers/Microsoft.Network/applicationGateways/foo' or the scope is invalid. If access was recently granted, please refresh your credentials."
E0502 15:44:20.509357       1 worker.go:62] Error processing event.network.ApplicationGatewaysClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="AuthorizationFailed" Message="The client 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' does not have authorization to perform action 'Microsoft.Network/applicationGateways/write' over scope '/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/foo/providers/Microsoft.Network/applicationGateways/foo' or the scope is invalid. If access was recently granted, please refresh your credentials."
I0502 15:44:20.509529       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"ingress-appgw-deployment-6899f8459b-h6ddz", UID:"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", APIVersion:"v1", ResourceVersion:"xxxxxxxx", FieldPath:""}): type: 'Warning' reason: 'FailedApplyingAppGwConfig' network.ApplicationGatewaysClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="AuthorizationFailed" Message="The client 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' with object id 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' does not have authorization to perform action 'Microsoft.Network/applicationGateways/write' over scope '/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/foo/providers/Microsoft.Network/applicationGateways/foo' or the scope is invalid. If access was recently granted, please refresh your credentials."

ohadschn avatar May 03 '22 17:05 ohadschn

We have the same problem (with the newly required "Network Contributor" role for the AGIC managed ID). We use bicep, but there is an inevitable delay between app gateway deployment and permission assignment of a couple of minutes. In that time, app gw tries to set up the listeners etc., fails and does not try again. Consequently our TLS does not get set up and our apps are unreachable. We have "solved" this by deleting the ingress pod at the end of the deployment, which causes it to re-spawn and go through the setup again. It works but it is not nice. It would be better if ingress would continue to retry (or re-spawn as suggested above).

bhavenst avatar Jul 05 '23 22:07 bhavenst