cluster-api-provider-gcp manually deleted worker VMs respawn but fail to properly bootstrap

trafficstars

/kind bug

What steps did you take and what happened:

spawn a remote k8s cluster on GCP via KIND bootstrap ala this provider
apply the machinedeployment to create 2 workers
manually delete 1 worker VM via the GCP console
observe that the VM is respawned, machine status is "provisioned" but the a node object is never registered because bootstrapping fails

What did you expect to happen:

bootstrapping to succeed and a node object to be registered for the new (replacement) machine

Anything else you would like to add:

logs from a machine that completed the initial bootstrap

Sep 20 19:25:33 test1-md-0-c9klg cloud-init[3321]: [preflight] Reading configuration from the cluster...
Sep 20 19:25:33 test1-md-0-c9klg cloud-init[3321]: [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
Sep 20 19:25:34 test1-md-0-c9klg systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.15" ConfigMap in the kube-system namespace
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Activating the kubelet service
Sep 20 19:25:34 test1-md-0-c9klg systemd[1]: Reloading.
Sep 20 19:25:34 test1-md-0-c9klg systemd[1]: Started kubelet: The Kubernetes Node Agent.
Sep 20 19:25:34 test1-md-0-c9klg cloud-init[3321]: [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

logs from a replacement machine that failed to bootstrap

Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: [preflight] Reading configuration from the cluster...
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: Cloud-init v. 18.3-52-gc5f78957-1~bddeb~18.04.1 running 'modules:final' at Fri, 20 Sep 2019 19:39:14 +0000. Up 58.44 seconds.
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: 2019-09-20 19:39:18,665 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: 2019-09-20 19:39:18,721 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Sep 20 19:39:18 test1-md-0-kkjxd cloud-init[3298]: 2019-09-20 19:39:18,722 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed

Environment:

Cluster-api version: v1alpha2
Minikube/KIND version: 0.5.x
Kubernetes version: (use kubectl version): 1.15.3
OS (e.g. from /etc/os-release): mixed (centos (local), ubuntu (gcp))

Sep 20 '19 20:09 jdef

This is a design decision from Cluster API, the Machines are treated as immutable, the bug here might have been that we actually try to recreate it instead of fast failing.

Sep 20 '19 21:09 vincepri

@vincepri what does "fast failing" mean in this context? label the machine as "gone" so that some periodic machine-gc-controller can remove it and make room for a new machine to spawn?

Sep 20 '19 21:09 jdef

@jdef Instead of recreating the instance the controller should set the ErrorMessage and ErrorReason in the status for the instance. For an example here is how the AWS provider handles it: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/257c5215d85a73ccf9acce343f93f438ee510bd9/controllers/awsmachine_controller.go#L265-L269

Sep 20 '19 21:09 detiber

NOTE: deleting the zombie/re-spawned machine via apiserver did the trick - the machine was torn down, and new machine came up that properly bootstrapped.

Sep 23 '19 13:09 jdef

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Dec 22 '19 13:12 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

Jan 21 '20 14:01 fejta-bot

/lifecycle frozen /priority awaiting-more-evidence

Jan 21 '20 17:01 vincepri

cluster-api-provider-gcp cluster-api-provider-gcp copied to clipboard

manually deleted worker VMs respawn but fail to properly bootstrap

cluster-api-provider-gcp
cluster-api-provider-gcp copied to clipboard