cloud-provider-azure Health Probes do not gracefully shutdown traffic for externalTrafficPolicy Cluster services

trafficstars

What happened:

I have been investigating disruption of services behind load balancers with their externalTrafficPolicy: Cluster (herein ETP Cluster). Currently in Azure, for an ETP Cluster service, the health probes on Azure are configured to check the node port of the application itself, using a blind TCP check. They can be configured to use an HTTP(S) based check but this only allows the path of the check to be specified, it doesn't allow you to specify a custom port.

When draining an application, for an ETP Local service, assuming you have configured graceful shutdown correctly, you can return a negative health probe while still accepting new connections, during the graceful termination period. This allows you to signal to the load balancer before you go away, that you are going away. Allowing it to route new connections to other nodes.

In an ETP Cluster service, this doesn't work. With the same backend, during the graceful shutdown period, the application will start showing a negative readiness probe within the cluster, and will be removed from the service endpoints. To the nodeport however, this is completely opaque and it has no idea that the pod has been removed from the node.

As soon as the application goes into an unready state, kube-proxy (or equivalent) on the node will start routing traffic to other endpoints within the service. The load balancer still believes that the instance is healthy, no matter how you configure the health checks. It is only once the node itself can no longer route traffic to these other endpoints, that the health probes start to fail. Because we are health checking, effectively, another load balancer, this can only happen after the application can no longer accept new connections (typically this means kube-proxy has shut down).

What you expected to happen:

I should be able to configure a health probe that tells Azure that my instance will shortly, no longer be able to serve traffic, allowing it to remove the instance from service before the instance can no longer accept new connections.

How to reproduce it (as minimally and precisely as possible):

I've created a test that you can run which runs some disruption tests to observe how much a service is disrupted when we upgrade a cluster (we run this within CI as part of OpenShift's release process).

To run it, you will need

A copy of oc on your PATH (download here)
Check out the disruption-test branch from github.com/JoelSpeed/origin
An Azure cluster running with your KUBECONFIG var pointing to the the cluster
To execute go test ./test/e2e/upgrade/service/disruption-test -v –timeout=60m from the checked out repo above

Note, this test will take approximately 30 minutes to run and will fail if there is any disruption to the service. We are typically seeing between 3 and 6 seconds of disruption, where we see 0 seconds with the equivalent for an ETP local service.

The test runs through each node in the cluster, drains the instance, reboots it, then uncordons it before moving onto the next instance.

Anything else we need to know?:

I think you don't necessarily need to reproduce this as we can have a discussion logically about this.

Since the service and kube-proxy handle the load balancing within the cluster, health checking the application itself is not helpful from an azure perspective to know whether to route traffic to the instance. Instead, we need to ascertain whether the instance has an ability to route traffic to the backend. This relies on whatever is implementing the node ports, in most cases this is kube-proxy.

Based on that, we should logically be checking kube-proxy's health, and not the node port itself.

In fact, GCP came to the same conclusion and, for an ETP Cluster service, they by default check port 10256 with path /healthz, which is exposed by kube-proxy. (internal and external).

I've been testing a patch that is very much WIP right now, but moves Azure's default ETP Cluster check to match GCPs and, in some testing with the above tests, I have achieved results of 0s of disruption on Azure as well. Because the Azure LB can now see when the instance is going to go unhealthy, it can remove the instance from service earlier meaning fewer dropped connections.

I would like to propose that we make that patch a feature of the Azure CCM and make sure that, by default, when no other health probe configuration is specified, Azure ETP Cluster services health check port 10256 with path /healthz instead of their current behaviour.

Some other notes:

I've been testing with ovn-kubernetes rather than kube-proxy, I'm working with the team there who have made ovn-k behave in the same way as kube-proxy for my testing, so the results should be the same
We also tried this with a patch that prevents Nodes from being removed from the load balancer when they are unready according to their status. We believe this is causing separate issues and I will raise a separate issue for this. Without the improved health checks, we were seeing 7-15s of disruption with this patch, with the improved health checks, we saw 0s.
@damdo and @ricky-rav have been helping me to put together the various patches we needed to get to this point
I was using the kubelets graceful node shutdown feature to test this, without it, disruption is around 4s while not removing instances from the load balancer, better than the 7-15 seconds we were seeing

Environment:

Kubernetes version (use kubectl version): 1.26.2
Cloud provider or hardware configuration: Azure
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools: OpenShift 4.13 pre-release
Network plugin and version (if this is a network-related bug): ovn-kubernetes (OpenShift 4.13 pre-release)
Others:

Mar 06 '23 12:03 JoelSpeed

I think this KEP which talks about graceful draining of connections might be related, I wonder if Azure maintainers have seen this?

Mar 10 '23 09:03 JoelSpeed

The following annotation is added to change the probe port number for service. FYI service.beta.kubernetes.io/port_{port}_health-probe_port

Mar 15 '23 06:03 MartinForReal

Hi @MartinForReal thanks for the suggestion, have you confirmed that this actually works with the Azure cloud provider?

I'm on PTO right now so don't have a laptop handy, but I did go through the code pretty thoroughly before submitting this issue and came to the conclusion that there was no way to override the port. I saw other annotations of a similar format, but this one didn't appear to be supported in Azure.

Mar 15 '23 08:03 JoelSpeed

I think it is added. https://github.com/kubernetes-sigs/cloud-provider-azure/blob/b8b4145fdaf3578616979437c60acc90ca4ec249/pkg/provider/azure_loadbalancer.go#L2119

Mar 15 '23 10:03 MartinForReal

Reviewing that, looks like we need to have the port in the service for it to be used as a health check right? https://github.com/kubernetes-sigs/cloud-provider-azure/blob/b8b4145fdaf3578616979437c60acc90ca4ec249/pkg/provider/azure_loadbalancer.go#L2145

So in that case, you'd expect to add a dummy port 10256 to the service, and then set service.beta.kubernetes.io/port_10256_health-probe_only, have I understood correctly?

Mar 24 '23 12:03 JoelSpeed

I've tried the above today to no luck, if I specify a port annotation to redirect the health check port on my service to port 10256, it errors because there is no port within the service with port 10256. If i were to create a port, that doesn't actually help me because I can't force the nodeport to be port 10256

The service I was trying is as below:

apiVersion: v1
kind: Service
metadata:
  name: test-service
  annotations:
    service.beta.kubernetes.io/port_80_health-probe_port: "10256"
    service.beta.kubernetes.io/port_10256_no_probe_rule: "true"
    service.beta.kubernetes.io/port_10256_no_lb_rule: "true"
spec:
  selector:
    k8s-app: test
  ports:
  - name: app
    protocol: TCP
    port: 80
  - name: healthz
    protocol: TCP
    port: 10256
  type: LoadBalancer
  externalTrafficPolicy: Cluster

So how can I get the load balancer health probe port to be an arbitrary port that isn't within the bounds of the service?

Mar 29 '23 12:03 JoelSpeed

Cc @alexanderConstantinescu I know you've been working on improving disruption to services, could you check out this issue and see if you agree/if there's any more you can add

Apr 21 '23 13:04 JoelSpeed

Thanks for the discussions. We should fix this and I have added it to v1.28 milestone. We would discuss and add a new design for the enhancement.

/assign @nilo19

Jul 05 '23 23:07 feiskyer

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 23 '24 22:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 22 '24 23:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 23 '24 23:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 23 '24 23:03 k8s-ci-robot

cloud-provider-azure cloud-provider-azure copied to clipboard

Health Probes do not gracefully shutdown traffic for externalTrafficPolicy Cluster services

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

cloud-provider-azure
cloud-provider-azure copied to clipboard