cluster-api-provider-gcp icon indicating copy to clipboard operation
cluster-api-provider-gcp copied to clipboard

manually deleted worker VMs respawn but fail to properly bootstrap

Open jdef opened this issue 6 years ago • 7 comments
trafficstars

/kind bug

What steps did you take and what happened:

  1. spawn a remote k8s cluster on GCP via KIND bootstrap ala this provider
  2. apply the machinedeployment to create 2 workers
  3. manually delete 1 worker VM via the GCP console
  4. observe that the VM is respawned, machine status is "provisioned" but the a node object is never registered because bootstrapping fails

What did you expect to happen:

bootstrapping to succeed and a node object to be registered for the new (replacement) machine

Anything else you would like to add:

  • logs from a machine that completed the initial bootstrap
Sep 20 19:25:33 test1-md-0-c9klg cloud-init[3321]: [preflight] Reading configuration from the cluster...
Sep 20 19:25:33 test1-md-0-c9klg cloud-init[3321]: [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
Sep 20 19:25:34 test1-md-0-c9klg systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.15" ConfigMap in the kube-system namespace
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Activating the kubelet service
Sep 20 19:25:34 test1-md-0-c9klg systemd[1]: Reloading.
Sep 20 19:25:34 test1-md-0-c9klg systemd[1]: Started kubelet: The Kubernetes Node Agent.
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
  • logs from a replacement machine that failed to bootstrap
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: [preflight] Reading configuration from the cluster...
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: Cloud-init v. 18.3-52-gc5f78957-1~bddeb~18.04.1 running 'modules:final' at Fri, 20 Sep 2019 19:39:14 +0000. Up 58.44 seconds.
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: 2019-09-20 19:39:18,665 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: 2019-09-20 19:39:18,721 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: 2019-09-20 19:39:18,722 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed

Environment:

  • Cluster-api version: v1alpha2
  • Minikube/KIND version: 0.5.x
  • Kubernetes version: (use kubectl version): 1.15.3
  • OS (e.g. from /etc/os-release): mixed (centos (local), ubuntu (gcp))

jdef avatar Sep 20 '19 20:09 jdef

This is a design decision from Cluster API, the Machines are treated as immutable, the bug here might have been that we actually try to recreate it instead of fast failing.

vincepri avatar Sep 20 '19 21:09 vincepri

@vincepri what does "fast failing" mean in this context? label the machine as "gone" so that some periodic machine-gc-controller can remove it and make room for a new machine to spawn?

jdef avatar Sep 20 '19 21:09 jdef

@jdef Instead of recreating the instance the controller should set the ErrorMessage and ErrorReason in the status for the instance. For an example here is how the AWS provider handles it: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/257c5215d85a73ccf9acce343f93f438ee510bd9/controllers/awsmachine_controller.go#L265-L269

detiber avatar Sep 20 '19 21:09 detiber

NOTE: deleting the zombie/re-spawned machine via apiserver did the trick - the machine was torn down, and new machine came up that properly bootstrapped.

jdef avatar Sep 23 '19 13:09 jdef

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 22 '19 13:12 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Jan 21 '20 14:01 fejta-bot

/lifecycle frozen /priority awaiting-more-evidence

vincepri avatar Jan 21 '20 17:01 vincepri