application-gateway-kubernetes-ingress httpGet-inferred health probes inspect irrelevant pods (completed/terminated)

httpGet-inferred health probes inspect irrelevant pods (completed/terminated)

Open ohads-MSFT opened this issue 2 years ago • 1 comments

Describe the bug Health probe inference looks at the "first pod", namely whatever pod is first in the namespace & service-selector filtered pod cache list: https://github.com/Azure/application-gateway-kubernetes-ingress/blob/80971d9a2654219c7ad40e506a7825289a607f97/pkg/k8scontext/context.go#L471

Unfortunately, this pod can be an irrelevant leftover from past deployments, typically a failed pod that was not garbage-collected (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection

To Reproduce

Manually fail some pods matching a service, and keep them around (don't delete the failed/completed pods)
Change some probe-relevant property in the ingress of one of the relevant, running (non-failed) pods
Depending on the order of the cached pod list mentioned above, health probes will sometimes take the property from a failed pod

Proposed Solution

Filter out pods that are Completed/Terminated: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle
Prioritize pods that are Running, then Pending, then Unknown
Then Prioritize pods whose containers are all Running
Then prioritize pods by creation timestamp (newest first)

So basically something like: podList.Where(pod => pod.State not in ["Completed", "Terminated"]).SortBy(pod => pod.State).SortBy(pod => pod.AllContainersRunning).SortBy(pod => pod.StartTime)[0]

Ingress Controller details This is behavior present in the current code: https://github.com/Azure/application-gateway-kubernetes-ingress/blob/80971d9a2654219c7ad40e506a7825289a607f97/pkg/appgw/health_probes.go#L253

Mar 08 '23 15:03 ohads-MSFT

Seems to be the same issue.

You update the health probe of a deployment and you have some evicted pods from the old deployment. The ingress controller will update the ingress with the old health probes. This took some hours of me before examining the AGIC code to see where it takes the probe information from.

Our issue has been solved by removing the evicted or completed pods. But this should be fixed in the code.

Jun 02 '23 12:06 DeadDuck

application-gateway-kubernetes-ingress application-gateway-kubernetes-ingress copied to clipboard

httpGet-inferred health probes inspect irrelevant pods (completed/terminated)

application-gateway-kubernetes-ingress
application-gateway-kubernetes-ingress copied to clipboard