cluster-api-provider-packet icon indicating copy to clipboard operation
cluster-api-provider-packet copied to clipboard

Control Plane rolling update stall with EIP

Open jhead-slg opened this issue 5 years ago • 15 comments

cluster-api v0.3.7 capp v0.3.2 packet-ccm v1.1.0

I am hitting an issue with cluster-api's ability to roll the control plane nodes. This appears to be because of how we bind the EIP to the node via the lo:0 interface and how cluster-api tears down the node's etcd instance before fully draining the workloads running on the node.

To reproduce, bring up a 1 node control plane with a 1 node worker node and then edit the KubeadmControlPlane's kubeadmConfigSpec changing the postKubeadmCommands to include an echo or other innocuous addition.

The new control plane node will begin deployment and eventually come into service. At some point, cluster-api kills the etcd of the older control plane node and the Packet CCM EIP health check moves the EIP to the new node. Once the etcd goes away, the kube-apiserver panics and the node stalls being unable to reach the EIP which is bound locally with no running kube-apiserver.

After several minutes, on the new control plane you can see various pods stuck in Terminating and/or Pending state. The cluster-api will not progress past this point.

# k get -A pods -o wide | grep -v Running
NAMESPACE        NAME                                               READY   STATUS        RESTARTS   AGE   IP              NODE                             NOMINATED NODE   READINESS GATES
core             cert-manager-webhook-69c8965665-49cfh              1/1     Terminating   0          11h   240.0.18.144    k8s-game-cp-1d5ce5-6wnjj         <none>           <none>
kube-system      cilium-operator-7597b4574b-bg94f                   1/1     Terminating   0          11h   10.66.5.5       k8s-game-cp-1d5ce5-6wnjj         <none>           <none>
kube-system      cilium-operator-7597b4574b-nlbjw                   0/1     Pending       0          20m   <none>          <none>                           <none>           <none>
kube-system      cilium-sjtk8                                       0/1     Pending       0          28m   <none>          <none>                           <none>           <none>
kube-system      coredns-66bff467f8-jznm9                           1/1     Terminating   0          11h   240.0.18.145    k8s-game-cp-1d5ce5-6wnjj         <none>           <none>
kube-system      coredns-66bff467f8-s77cv                           0/1     Pending       0          20m   <none>          <none>                           <none>           <none>
topolvm-system   controller-7d85c6bbbc-8ps5q                        0/5     Pending       0          20m   <none>          <none>                           <none>           <none>
topolvm-system   controller-7d85c6bbbc-ppvvz                        5/5     Terminating   0          11h   240.0.18.12     k8s-game-cp-1d5ce5-6wnjj         <none>           <none>

To get things moving again you have to go onto the old control plane node and ip addr del <EIP>/32 dev lo. Once this is done, the local kubelet can talk again to the API, the cluster-api evicts the pods, and the old node is deleted.

I believe these issues may be related:

https://github.com/kubernetes-sigs/cluster-api/issues/2937 https://github.com/kubernetes-sigs/cluster-api/issues/2652

As a work around, I created the following script along with a systemd service which gets installed into all control plane nodes. This setup allows the rolling update to occur without manual interaction.

Script:

#!/usr/bin/env bash

set -o errexit
set -o nounset
set -o pipefail

EIP=$1

while true; do
    rc=0
    curl -fksS --retry 9 --retry-connrefused --retry-max-time 180 https://$EIP:6443/healthz || rc=$?
    if [[ $rc -eq 7 ]]; then
        echo "removing EIP $EIP"
        ifdown lo:0
        ip addr del $EIP/32 dev lo || true
        break
    fi
    echo ""
    sleep $(($RANDOM % 15))
done

postKubeadmCommands addition:

        cat <<EOT > /etc/systemd/system/packet-eip-health.service
        [Unit]
        Description=Packet EIP health check
        Wants=kubelet.service
        After=kubelet.service

        [Service]
        Type=simple
        Restart=on-failure
        ExecStart=/usr/local/bin/packet-eip-health.sh {{ .controlPlaneEndpoint }}

        [Install]
        WantedBy=multi-user.target
        EOT

        systemctl daemon-reload
        systemctl enable packet-eip-health
        systemctl start packet-eip-health

jhead-slg avatar Jul 30 '20 00:07 jhead-slg

Let me see if I understand this. When I saw node, I mean "control plane node"

  1. node A is in good state
  2. node B is brought up
  3. node A needs to be brought down
  4. node A apiserver goes down
  5. CCM sees node A apiserver is down, switches EIP to node B
  6. CAPI kills etcd on node A
  7. node A still has some processes that need to talk to etcd, no longer can talk local, so try to talk to the loadbalancer EIP
  8. node A still has EIP configured locally, so it tries to reach etcd locally, fails

Is that correct?

deitch avatar Jul 30 '20 08:07 deitch

Yes, that is mostly correct. I believe step 6 happens after step 3 which causes the API to die as well.

jhead-slg avatar Jul 30 '20 16:07 jhead-slg

So what really needs to happen is, once node A goes down (step 4), it needs the local IP routing removed. Correct?

deitch avatar Jul 30 '20 17:07 deitch

Correct.

jhead-slg avatar Jul 30 '20 17:07 jhead-slg

Thanks for the clarity. It would be nice not to have to deal with the IP locally at all. E.g. if the EIP were 100.10.10.10, and the node IPs were 100.10.10.20 and 100.10.10.30, then it would work perfectly. The problem is you need a real load balancer doing inbound NAT (changing the dst IP on the packet that hits the host) in front of it to get there, rather than lower-level network primitives (routers and switches).

BGP helps, but doesn't completely solve it. Same with EIP. FWIW, the Kubernetes kube-proxy also helps, as it sets up iptables rules, independent of the local routes. I wouldn't mind trying to leverage that, but kube-proxy is, essentially, global. All hosts have it, and the rules are the same across all of them.

CCM itself is a Deployment with replicas=1, so it cannot control the IP addr/routes/iptables on a different host, unless we deploy another DaemonSet.

deitch avatar Jul 31 '20 08:07 deitch

Also, your fix works well when installing via CAPP (hence the issue on this repo), but the EIP is controlled via CCM, and needs to account for non-CAPP situations.

deitch avatar Jul 31 '20 08:07 deitch

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Oct 29 '20 09:10 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Nov 28 '20 09:11 fejta-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

fejta-bot avatar Dec 28 '20 10:12 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 28 '20 10:12 k8s-ci-robot

/reopen

cprivitere avatar Mar 19 '24 13:03 cprivitere

/remove-lifecycle rotten

cprivitere avatar Mar 19 '24 13:03 cprivitere

@cprivitere: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 19 '24 13:03 k8s-ci-robot

This should be tested with the latest CPEM to see if the daemon set changes resolves it.

cprivitere avatar Mar 19 '24 13:03 cprivitere

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 17 '24 14:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 17 '24 14:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Aug 16 '24 15:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 16 '24 15:08 k8s-ci-robot