cloud-provider-azure
cloud-provider-azure copied to clipboard
Health Probes do not gracefully shutdown traffic for externalTrafficPolicy Cluster services
What happened:
I have been investigating disruption of services behind load balancers with their externalTrafficPolicy: Cluster (herein ETP Cluster). Currently in Azure, for an ETP Cluster service, the health probes on Azure are configured to check the node port of the application itself, using a blind TCP check. They can be configured to use an HTTP(S) based check but this only allows the path of the check to be specified, it doesn't allow you to specify a custom port.
When draining an application, for an ETP Local service, assuming you have configured graceful shutdown correctly, you can return a negative health probe while still accepting new connections, during the graceful termination period. This allows you to signal to the load balancer before you go away, that you are going away. Allowing it to route new connections to other nodes.
In an ETP Cluster service, this doesn't work. With the same backend, during the graceful shutdown period, the application will start showing a negative readiness probe within the cluster, and will be removed from the service endpoints. To the nodeport however, this is completely opaque and it has no idea that the pod has been removed from the node.
As soon as the application goes into an unready state, kube-proxy (or equivalent) on the node will start routing traffic to other endpoints within the service. The load balancer still believes that the instance is healthy, no matter how you configure the health checks. It is only once the node itself can no longer route traffic to these other endpoints, that the health probes start to fail. Because we are health checking, effectively, another load balancer, this can only happen after the application can no longer accept new connections (typically this means kube-proxy has shut down).
What you expected to happen:
I should be able to configure a health probe that tells Azure that my instance will shortly, no longer be able to serve traffic, allowing it to remove the instance from service before the instance can no longer accept new connections.
How to reproduce it (as minimally and precisely as possible):
I've created a test that you can run which runs some disruption tests to observe how much a service is disrupted when we upgrade a cluster (we run this within CI as part of OpenShift's release process).
To run it, you will need
- A copy of
ocon your PATH (download here) - Check out the disruption-test branch from github.com/JoelSpeed/origin
- An Azure cluster running with your
KUBECONFIGvar pointing to the the cluster - To execute
go test ./test/e2e/upgrade/service/disruption-test -v –timeout=60mfrom the checked out repo above
Note, this test will take approximately 30 minutes to run and will fail if there is any disruption to the service. We are typically seeing between 3 and 6 seconds of disruption, where we see 0 seconds with the equivalent for an ETP local service.
The test runs through each node in the cluster, drains the instance, reboots it, then uncordons it before moving onto the next instance.
Anything else we need to know?:
I think you don't necessarily need to reproduce this as we can have a discussion logically about this.
Since the service and kube-proxy handle the load balancing within the cluster, health checking the application itself is not helpful from an azure perspective to know whether to route traffic to the instance. Instead, we need to ascertain whether the instance has an ability to route traffic to the backend. This relies on whatever is implementing the node ports, in most cases this is kube-proxy.
Based on that, we should logically be checking kube-proxy's health, and not the node port itself.
In fact, GCP came to the same conclusion and, for an ETP Cluster service, they by default check port 10256 with path /healthz, which is exposed by kube-proxy. (internal and external).
I've been testing a patch that is very much WIP right now, but moves Azure's default ETP Cluster check to match GCPs and, in some testing with the above tests, I have achieved results of 0s of disruption on Azure as well. Because the Azure LB can now see when the instance is going to go unhealthy, it can remove the instance from service earlier meaning fewer dropped connections.
I would like to propose that we make that patch a feature of the Azure CCM and make sure that, by default, when no other health probe configuration is specified, Azure ETP Cluster services health check port 10256 with path /healthz instead of their current behaviour.
Some other notes:
- I've been testing with ovn-kubernetes rather than kube-proxy, I'm working with the team there who have made ovn-k behave in the same way as kube-proxy for my testing, so the results should be the same
- We also tried this with a patch that prevents Nodes from being removed from the load balancer when they are unready according to their status. We believe this is causing separate issues and I will raise a separate issue for this. Without the improved health checks, we were seeing 7-15s of disruption with this patch, with the improved health checks, we saw 0s.
- @damdo and @ricky-rav have been helping me to put together the various patches we needed to get to this point
- I was using the kubelets graceful node shutdown feature to test this, without it, disruption is around 4s while not removing instances from the load balancer, better than the 7-15 seconds we were seeing
Environment:
- Kubernetes version (use
kubectl version): 1.26.2 - Cloud provider or hardware configuration: Azure
- OS (e.g:
cat /etc/os-release): - Kernel (e.g.
uname -a): - Install tools: OpenShift 4.13 pre-release
- Network plugin and version (if this is a network-related bug): ovn-kubernetes (OpenShift 4.13 pre-release)
- Others:
I think this KEP which talks about graceful draining of connections might be related, I wonder if Azure maintainers have seen this?
The following annotation is added to change the probe port number for service. FYI service.beta.kubernetes.io/port_{port}_health-probe_port
Hi @MartinForReal thanks for the suggestion, have you confirmed that this actually works with the Azure cloud provider?
I'm on PTO right now so don't have a laptop handy, but I did go through the code pretty thoroughly before submitting this issue and came to the conclusion that there was no way to override the port. I saw other annotations of a similar format, but this one didn't appear to be supported in Azure.
I think it is added. https://github.com/kubernetes-sigs/cloud-provider-azure/blob/b8b4145fdaf3578616979437c60acc90ca4ec249/pkg/provider/azure_loadbalancer.go#L2119
Reviewing that, looks like we need to have the port in the service for it to be used as a health check right? https://github.com/kubernetes-sigs/cloud-provider-azure/blob/b8b4145fdaf3578616979437c60acc90ca4ec249/pkg/provider/azure_loadbalancer.go#L2145
So in that case, you'd expect to add a dummy port 10256 to the service, and then set service.beta.kubernetes.io/port_10256_health-probe_only, have I understood correctly?
I've tried the above today to no luck, if I specify a port annotation to redirect the health check port on my service to port 10256, it errors because there is no port within the service with port 10256. If i were to create a port, that doesn't actually help me because I can't force the nodeport to be port 10256
The service I was trying is as below:
apiVersion: v1
kind: Service
metadata:
name: test-service
annotations:
service.beta.kubernetes.io/port_80_health-probe_port: "10256"
service.beta.kubernetes.io/port_10256_no_probe_rule: "true"
service.beta.kubernetes.io/port_10256_no_lb_rule: "true"
spec:
selector:
k8s-app: test
ports:
- name: app
protocol: TCP
port: 80
- name: healthz
protocol: TCP
port: 10256
type: LoadBalancer
externalTrafficPolicy: Cluster
So how can I get the load balancer health probe port to be an arbitrary port that isn't within the bounds of the service?
Cc @alexanderConstantinescu I know you've been working on improving disruption to services, could you check out this issue and see if you agree/if there's any more you can add
Thanks for the discussions. We should fix this and I have added it to v1.28 milestone. We would discuss and add a new design for the enhancement.
/assign @nilo19
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.