cloud-provider-openstack [occm]: introduce readiness gates for pods

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

Current LoadBalancer service implementation has a flaw: if during the deployment update a node's network is broken, but kubelet advertises that the pod is up and ready, the new deployment can cause an outage by shutting down old healthy pods.

What you expected to happen:

There should be a way to monitor pod's network readiness outside the k8s cluster using pods' readiness gates. See more details in this video: https://www.youtube.com/watch?v=Vw9GmSeomFg&t=289s

Anything else we need to know?:

In #1720 PR the externalTrafficPolicy: Local was introduced, which adds kube-proxy based monitors to LB pool members. We can keep this logic for OCCM configured without a router.

However for OCCM with a router configured, we can use service's endpoints instead of node ports and according to loadbalancer's member healthchecks patch the pods' readiness gates accordingly. This approach would increase a deployment update time (because of the LB healthchecks latency), but from other side this will increase the overall deployment accessibility. Additional set of advantages: more even traffic distribution between pods, and pod-based traffic affinity.

See also https://cloud.google.com/kubernetes-engine/docs/concepts/container-native-load-balancing#pod_readiness and https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.1/deploy/pod_readiness_gate/ descriptions. Unlike GCE and AWS dedicated ingress/LB controllers, I'd like to implement readiness gates feature directly in the OCCM controller for seamless feature toggle, especially for existing deployments.

Feel free to provide your suggestions or objections.

cc @databus23 @jichenjc @zetaab

Mar 08 '23 10:03 kayrus

can current OCCM loadbalancer implementation route traffic to service endpoints? I cannot see anything in code, so I think its not working (yet?).

Another thing that I am thinking: when you create octavia amphora loadbalancer, how you are going to add routes to that? Those vrrp ports are visible under ports, but not sure can you modify routes to those ones and how everything will even work. Can you add service endpoint (pod ip) as target to loadbalancer (because pod ip is not port in openstack, if you do not use things like kuryr). In that case it might be that you need also modify openstack port security policies

For me this looks like you are not running overlay network at all inside kubernetes clusters, using kuryr or similar and all pods are located in openstack network?

Mar 08 '23 14:03 zetaab

can current OCCM loadbalancer implementation route traffic to service endpoints? I cannot see anything in code, so I think its not working (yet?).

if routes a defined on the router configured for the private network, then traffic from loadbalancer to a particular pod CIDR is routed through a router to a corresponding node.

Another thing that I am thinking: when you create octavia amphora loadbalancer, how you are going to add routes to that?

routes are configured on the router, see above.

Can you add service endpoint (pod ip) as target to loadbalancer (because pod ip is not port in openstack, if you do not use things like kuryr).

yes, see above.

For me this looks like you are not running overlay network at all inside kubernetes clusters, using kuryr or similar and all pods are located in openstack network?

right. but this shouldn't be a requirement for direct pods routing.

Mar 08 '23 15:03 kayrus

However for OCCM with a router configured, we can use service's endpoints instead of node ports and according to loadbalancer's member healthchecks patch the pods' readiness gates accordingly.

so this is the key change proposed, correct? use svc's endpoint and if one pod is in a LB member's backend , when the LB monitor detect something wrong then the pod will be marked unhealthy?

Mar 09 '23 06:03 jichenjc

@jichenjc correct

Mar 09 '23 07:03 kayrus

This sounds like something that has assumptions on the how the CNI would react - i.e. would it route traffic that comes to the node with a Pod IP into the actual Pod netns. This doesn't feel right when CNI does encapsulation. What CNI this proposal has in mind?

Mar 10 '23 15:03 dulek

We use flannel in our setup:

# iptables-save | grep 100.101.0.0/24
-A POSTROUTING ! -s 100.101.0.0/16 -d 100.101.0.0/24 -m comment --comment "flanneld masq" -j RETURN

I'm not sure how does it work with other CNIs.

Mar 10 '23 18:03 kayrus

This doesn't feel right when CNI does encapsulation

good suggestion , Thanks for the reminder, if something like Calico does not work then we need re-think about how we can get this

Mar 13 '23 01:03 jichenjc

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 11 '23 02:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 11 '23 02:07 k8s-triage-robot

@dulek Is this related to your security group PR in any way? Are we happy letting this close? Do we have another solution?

Jul 11 '23 08:07 mdbooth

/remove-lifecycle rotten

Jul 11 '23 14:07 kayrus

@mdbooth I still need this

Jul 11 '23 14:07 kayrus

@dulek Is this related to your security group PR in any way? Are we happy letting this close? Do we have another solution?

I don't see it as related to my work. If I understand the problem here correctly - it's about the LB members being "ONLINE" by default. I discussed that issue with Octavia folks at the OpenInfra Summit. So we can't really expect this to be implemented in Octavia as Octavia mimics the behavior of the hardware LBs which do it this way.

The solution they offered is to make the health monitor periods smaller. I bet that doesn't really work here?

Had we considered adding new members of the LBs as disabled and only enabling them after some timeout?

Jul 11 '23 14:07 dulek

Had we considered adding new members of the LBs as disabled and only enabling them after some timeout?

good idea. I need to test it in my env.

Jul 11 '23 14:07 kayrus

Had we considered adding new members of the LBs as disabled and only enabling them after some timeout?

good idea. I need to test it in my env.

I won't be surprised if disabled ones aren't even evaluated by the health monitors, but please check that.

Also can you confirm that my understanding of the problem is correct? I'd love to find a viable solution to this.

Jul 11 '23 15:07 dulek

Also can you confirm that my understanding of the problem is correct?

right

Jul 11 '23 15:07 kayrus

So this issue appears to be more critical than it was before. I made a number of tests and neither creating a member in a backup nor in a disabled state helped. And even worse: each member state update causes an outage until healthcheck verdict is updated.

Sep 04 '23 14:09 kayrus

Can you elaborate how the healthcheck system is defined for you then?

Sep 05 '23 15:09 dulek

every member add/update action triggers the members healthcheck. if there is only 1/20 active member, the traffic will reach all 20 members until the healthcheck monitor verify all the pool members. default healthcheck monitor is 3 tries with 20 seconds delay -> 1 minute. thus the traffic within a minute will be forwarded to inactive members.

Sep 05 '23 16:09 kayrus

Wait, so when I add a member, Octavia/Amphora will reset states and consider all members to be healthy until all healthchecks are resolved? That sounds like an Octavia bug to me.

Sep 06 '23 13:09 dulek

@dulek right. can you check this behavior in your env?

Sep 08 '23 08:09 kayrus

Another drawback of the externalTrafficPolicy: Local, removing a pod from an existing service will cause downtime, since LB pool member has a delay in healthchecks.

Example, you have 4 nodes with 4 pods (node anti-affinity) and a single service with the type: loadbalancer and externalTrafficPolicy: Local:

node1 -> pod1
node2 -> pod2
node3 -> pod3
node4 -> pod4

pool members will look like:

member1 -> node1 -> pod1
member2 -> node2 -> pod2
member3 -> node3 -> pod3
member4 -> node4 -> pod4

scaling down pod deployment to 3 will remove a pod replica from a node (e.g. node4), but due to LB helatcheck latency, there is a high chance that a new connection will be forwarded to the node4, which may cause a connection timeout:

member1 -> node1 -> pod1
member2 -> node2 -> pod2
member3 -> node3 -> pod3
member4 -> node4 -> X

The same problems will cause regular deployment updates, when a new pod is created and an old pod is removed.

So far I assume that readiness gate controller must be event-driven and proactively disable or mark offline the corresponding members, when a pod is being destroyed.

UPD: I found that the ProxyTerminatingEndpoints feature gate was added as beta in 1.26 https://github.com/kubernetes/kubernetes/issues/85643. The behavior with/without this feature is a bit different, but the traffic (new connection) in both cases are still get broken.

Sep 14 '23 14:09 kayrus

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 28 '24 10:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 27 '24 11:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 28 '24 12:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 28 '24 12:03 k8s-ci-robot

cloud-provider-openstack cloud-provider-openstack copied to clipboard

[occm]: introduce readiness gates for pods

cloud-provider-openstack
cloud-provider-openstack copied to clipboard