hcloud-cloud-controller-manager Readiness gate support for 100% ingress HA

Readiness gate support for 100% ingress HA

Open morsik opened this issue 6 months ago • 7 comments

Please add support for Readiness Gates, so Ingress NGINX (and others) will have injected information if Hetzner Cloud already have target properly attached.

Right now, without that information when I redeploy NGINX to other nodes, it creates new Pods too fast on different nodes in comparizon to Hetzner and it just breaks ingress for half a minute until Hetzner reconfigured their load balancer which takes time.

Dec 08 '23 11:12 morsik

When you create a Service with type: LoadBalancer this happens:

Kubernetes assigns a NodePort for every port in the service
HCCM creates a new Load Balancer
For every Port in the Kubernetes Service, HCCM creates a new "Service" in the Load Balancer, listening on the port and forwarding the traffic to the Node Port
For every Node in the Kubernetes Cluster, HCCM creates a new "Target" in the Load Balancer, pointing at the Node/Server

Once the traffic is sent to the Node Port on any Kubernetes node, Kubernetes/CNI is now responsible for the traffic. Usually kube-proxy or your CNI then checks where Pods for the service are running, and forwards the traffic to one the pods. It does not matter on which Node the pod is running. If you have Pods marked as Ready in your Service, which are actually not ready to receive traffic, that would explain your problem.

Dec 11 '23 09:12 apricote

@apricote ok, I'll explain problem with more detail because you don't understand what I mean :D

So, here is usecase:

4 nodes in cluster
1 nginx ingress running on any node

Status in Hetzner LoadBalancer is:

8 targets (80+443) - because of 4 nodes as backend with 2 ports each
2 targets healtly - because ingress-nginx works on 1 node with 2 ports each

Now, the problematic scenario:

Run kubectl -n ingress-nginx delete pod ingress-nginx-rsid-1234
Kubernetes terminated nginx pod and starts it immiedately on another node which Pod state is Running and Ready.
Hetzner LB still has healtly state to old target (because of healthchecks with longer interval). website is down since now
Going into website throws error because you cannot reach backend because LB points to old target.
Hetzner LB runs few healthchecks over targets and finally decides that old target (on old node) is not healtly anymore and decides that new target (on new node) is now healtly.
Website is up again after 30-60seconds.

This is real problem, the only solution for this is to run ingress-nginx as DaemonSet so Hetzner will always point to all nodes. Altough, recreating those pods will still have same problem.

Readiness Gate can inject information from external LB about it's state, so Kubernetes will be aware that new pod (on new node) is not ready yet (even though application started and is ready) until external loadbalancer will have it in healtly state.

Check this how it works in AWS: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.1/deploy/pod_readiness_gate/ especially usecase: "Rollout of new pods takes less time than it takes the AWS Load Balancer controller to register the new pods and for their health state turn »Healthy« in the target group"

Dec 11 '23 11:12 morsik

2 targets healtly - because ingress-nginx works on 1 node with 2 ports each

How do you deploy ingress-nginx? Usually Kubernetes routes traffic arriving on these ports to whatever Node currently has a matching Pod, so I would expect all targets to be healthy.

Looking at the AWS docs, I think they need this because they route directly from the Load Balancer to the Internal IP of the Pod (through ENI), which is not something that Hetzner Cloud Load Balancers do, they always target your Nodes.

Jan 25 '24 10:01 apricote

@apricote I'm deploying NGINX as regular LoadBalancer object, hence it exposes ports 80, 443 (and 22 for GitLab in my case) on Node Ports.

Also - to be honest - I would prefer my current behaviour over "Kubernetes accepting traffic and routing it everywhere", because then I have control over where my traffic goes from LB to nodes (for example if I would like to limit traffic from LB to node to be only in single region - I don't want traffic to go from Falkenstein to Helsinki because latency there is terrible 22ms!)

NAME                                            READY   STATUS    RESTARTS   AGE    IP             NODE                    NOMINATED NODE   READINESS GATES
pod/ingress-nginx-controller-76df64f546-h8bsj   1/1     Running   0          6d3h   10.244.4.218   kube-node-nbg1-cx31-1   <none>           <none>
pod/ingress-nginx-controller-76df64f546-wcwnf   1/1     Running   0          6d3h   10.244.0.58    kube-master-1           <none>           <none>

NAME                                         TYPE           CLUSTER-IP       EXTERNAL-IP           PORT(S)                                   AGE
service/cert-manager-webhook-hetzner         ClusterIP      10.97.14.83      <none>                443/TCP                                   59d
service/ingress-nginx-controller             LoadBalancer   10.98.229.172    kube-lb.example.com   80:31246/TCP,443:30411/TCP,22:31811/TCP   59d
service/ingress-nginx-controller-admission   ClusterIP      10.103.212.168   <none>                443/TCP                                   59d

Helm Config for my ingress-nginx:

controller:
  service:
    externalTrafficPolicy: Local

    annotations:
      external-dns.alpha.kubernetes.io/hostname: "*.k8s.trng.me,*.example.com"
      load-balancer.hetzner.cloud/name: "kube"
      load-balancer.hetzner.cloud/location: fsn1
      load-balancer.hetzner.cloud/use-private-ip: "true"
      load-balancer.hetzner.cloud/uses-proxyprotocol: "true"
      # ExternalIP as hostname NEEDS TO BE set, otherwise
      # hcloud-loadbalancer-controller will put here real extIP of LB and
      # kube-proxy will route that traffic internall instead of to real LB,
      # and traffic will be broken because non-proxied traffic will to into proxy-expected nginx
      load-balancer.hetzner.cloud/hostname: kube-lb.example.com

  config:
    use-proxy-protocol: true
    use-forwarded-headers: "true"
    compute-full-forwarded-for: "true"

  extraArgs:
    default-ssl-certificate: "infra/tls-default"

  watchIngressWithoutClass: false
  ingressClassByName: false

  ingressClassResource:
    name: ohcs-nginx  # default: nginx
    #enabled: true
    default: true
    controllerValue: "k8s.io/ohcs-nginx"  # default: k8s.io/ingress-nginx

  replicaCount: 2

  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0

  minReadySeconds: 30

  resources:
    requests:
      cpu: 15m
      memory: 128Mi
    limits:
      #cpu: 1
      memory: 192Mi

  tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule

  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 10
        preference:
          matchExpressions:
          - key: node-role.kubernetes.io/control-plane
            operator: Exists
      - weight: 20
        preference:
          matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - arm64
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: controller
              app.kubernetes.io/instance: ingress-nginx
              app.kubernetes.io/name: ingress-nginx
          topologyKey: kubernetes.io/hostname
        weight: 50

tcp:
  22: "gitlab/gitlab-gitlab-shell:22"

Jan 25 '24 13:01 morsik

Okay, externalTrafficPolicy: Local stops Kubernetes from forwarding requests to a server with a matching Pod and only answers the requests if the local node has a matching Pod.

for example if I would like to limit traffic from LB to node to be only in single region - I don't want traffic to go from Falkenstein to Helsinki because latency there is terrible 22ms!

You can do this with the load-balancer.hetzner.cloud/node-selector annotation (added in 1.18.0). There you can specify any Node label selector, for example topology.kubernetes.io/region=fsn1. Afterwards, only the nodes matching the label selector will be added to the Load Balancer.

Jan 25 '24 13:01 apricote

@apricote I know I can use label, but... I don't want to use it because I still want to be able to use Helsinki as a last resort in case of failure of other regions ;) I think this makes sense.

Direct traffic from LB just makes more sense than random balancing which will be balanced again by Kube. Latency is the answer for "but why"...

Jan 25 '24 14:01 morsik

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

Apr 25 '24 12:04 github-actions[bot]

hcloud-cloud-controller-manager hcloud-cloud-controller-manager copied to clipboard

Readiness gate support for 100% ingress HA

hcloud-cloud-controller-manager
hcloud-cloud-controller-manager copied to clipboard