cloud-provider-aws Update health checks to use Kube-Proxy when no other configuration is provided

Update health checks to use Kube-Proxy when no other configuration is provided

Open JoelSpeed opened this issue 1 year ago • 17 comments

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a service is using the default configuration (ie they haven't changed any of port, path or protocol via annotations), this updates the health check to, rather then checking the traffic port, check the kube-proxy health endpoint.

This is deliberately done so that if a user has specified any change to the port/path/protocol, we do not change the behaviour of their health checks. But if they have no opinion at all, we give them an improved health check.

Which issue(s) this PR fixes:

KEP 3458 was added to the AWS provider in release 1.27. This means now that a nodes presence in the load balancer backends, is no longer determined by whether the instance is healthy or not.

We have observed long periods of disruption (15-20s) during upgrades with the AWS provider since this went in. The issue here is because the health check is checking the traffic port, and not the nodes ability to route traffic to the backend.

This is specifically for exxternalTrafficPolicy: Cluster. In this topology, all nodes in the cluster accept connections for the backend for the service, and then route that internally. So, unless all endpoints are down, the traffic port that is exposed will always return a healthy result. When a node is going away, it's ability to route the traffic from the traffic port to the backend, is based on whether kube-proxy is running or not. In fact, so much so, that actually, it makes more sense to health check kube-proxy.

When kube-proxy is being terminated, assuming you have configured graceful shutdown, it will return bad healthchecks for a period before shutting down. This allows the AWS health check to observe the node is going away, before it loses the ability to route traffic to the backend. In the current guise, it only fails health checks after the node can no longer route traffic.

I did a lot of research into this earlier in the year and wrote up kubernetes-sigs/cloud-provider-azure#3499 which explains the problem in a little more depth, note as well, that GCP uses kube-proxy based health checks in this manner and we are working to move Azure over too.

There's also KEP-3836 which suggests this is the way to go forward as well.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Update health checks to use Kube-Proxy when no other configuration is provided

Jun 28 '23 11:06 JoelSpeed

cloud-provider-aws cloud-provider-aws copied to clipboard

Update health checks to use Kube-Proxy when no other configuration is provided

cloud-provider-aws
cloud-provider-aws copied to clipboard