cluster-api-provider-vsphere icon indicating copy to clipboard operation
cluster-api-provider-vsphere copied to clipboard

Delete bootstrap data after a VM is online

Open voor opened this issue 6 years ago • 18 comments

/kind feature

Describe the solution you'd like After deploying a VM through Cluster API I would like to empty out the guestinfo.userdata as it contains sensitive information like private keys, the vsphere password, etc.

Right now:

export VM=/Datacenter/vm/k8s/management-cluster-controlplane-0
govc vm.info -json ${VM} | jq -r '.VirtualMachines[].Config.ExtraConfig[] | select(.Key == "guestinfo.userdata").Value' | base64 --decode
# Secrets and stuff

Temporary fix:

govc vm.change -vm ${VM} -e guestinfo.userdata=""

Anything else you would like to add:

Environment:

  • Cluster-api-provider-vsphere version: v0.5.1
  • Kubernetes version: (use kubectl version): 1.15.3
  • OS (e.g. from /etc/os-release): Ubuntu

voor avatar Sep 21 '19 00:09 voor

/assign @akutz

andrewsykim avatar Sep 23 '19 20:09 andrewsykim

Related to kubernetes-sigs/cluster-api#1739

akutz avatar Dec 08 '19 02:12 akutz

A quick solution is adding the following command into cloud-init before the kubeadm command:

	// Clear guestinfo.userdata after it's read by cloud-init to prevent security leaks.
	// There are 2 spaces at the end of this vmtoolsd command. 1 space or no space doesn't work.
	// vmtoolsd does not support removing a key under guestinfo, so we set the value to a single space character.
	cmdClearUserData = `vmtoolsd --cmd 'info-set guestinfo.userdata  '`

jessehu avatar Dec 08 '19 04:12 jessehu

Hi @jessehu,

That's a really good suggestion, but it would be required to be placed in the generated KubeadmConfig resource ahead of time. Plus, we're trying to move away from vmtoolsd in favor of vmware-rpctool as the former seems to have performance issues, especially on Photon (see https://github.com/vmware/cloud-init-vmware-guestinfo/pull/23 for more information).

My current thought is to remove the data via a reconfigure call after the machine is online.

Thoughts?

akutz avatar Dec 08 '19 04:12 akutz

Thanks @akutz for the notice. Make sense. We will need to find out the earliest time when CAPV controller can safely delete userdata, because the sooner the safer.

jessehu avatar Dec 08 '19 04:12 jessehu

Absolutely. My other thought is to actually update the datasource so I can pass in a flag via the metadata, which is built at runtime in CAPV. If the flag is set, then it will automatically add the vmware-rpctool command to the end of the user data's runcmd list to delete guestinfo.userdata.

For example, imagine the following metadata:

cleanup-userdata: true
instance-id: iid-capi
local-hostname: capi.vm
network:
  version: 2
  ethernets:
    nics:
      match:
        name: ens*
      dhcp4: yes
      dhcp6: yes

The flag guestinfo.metadata.cleanup-userdata set to true should cause the following command to be inserted as the first element in the user data's runcmd list:

runcmd:
  - "vmware-rpctool \"info-set guestinfo.userdata  \""
  - "hostname \"{{ ds.meta_data.hostname }}\""
  - "echo \"::1         ipv6-localhost ipv6-loopback\" >/etc/hosts"
  - "echo \"127.0.0.1   localhost {{ ds.meta_data.hostname }}\" >>/etc/hosts"
  - "echo \"{{ ds.meta_data.hostname }}\" >/etc/hostname"
  - 'kubeadm init --config /tmp/kubeadm.yaml'

I've verified the above command will correctly clear the user data. I'm using sh below to illustrate this works so I can duplicate the escaped " characters as they are above:

$ sh -c "vmware-rpctool \"info-set guestinfo.userdata hi\""
$ sh -c "vmware-rpctool \"info-get guestinfo.userdata\"" 2>&1 | grep -v '^[[:space:]]\{1,\}$' || echo empty
hi
$ sh -c "vmware-rpctool \"info-set guestinfo.userdata  \""
$ sh -c "vmware-rpctool \"info-get guestinfo.userdata\"" 2>&1 | grep -v '^[[:space:]]\{1,\}$' || echo empty
empty

What do you think?

akutz avatar Dec 08 '19 04:12 akutz

Hi @jessehu,

It's worth noting that all of the options we've discussed thus far will still leave the user data file(s) on disk that cloud-init writes on first-boot, /var/lib/cloud/instance/user-data.txt*.

akutz avatar Dec 08 '19 04:12 akutz

Hi @jessehu,

Perhaps we should augment the datasource metadata with a more in-depth user data cleanup option? For example:

cleanup-userdata:
  guestinfo:  true|false
  filesystem: true|false

The problem is that I am not aware of the consequences that might occur, possible side-effects, of removing the user data file(s) cloud-init creates on the local filesystem under /var/lib/cloud. @detiber, I know removing /var/lib/cloud is part of getting cloud-init to run again as if a fresh system, but do you know what happens if we just remove the user data files and leave the rest?

akutz avatar Dec 08 '19 04:12 akutz

Hi @jessehu,

Please take a look at https://github.com/vmware/cloud-init-vmware-guestinfo/pull/25. It should account for this. As soon as we can merge it and build new images, we will be able to leverage the feature to remove the userdata as early as possible.

akutz avatar Dec 08 '19 06:12 akutz

Thanks @akutz. I added some minor comments for https://github.com/vmware/cloud-init-vmware-guestinfo/pull/25, please take a look.

jessehu avatar Dec 08 '19 07:12 jessehu

Good news, I tested https://github.com/vmware/cloud-init-vmware-guestinfo/pull/25 with a newly built CentOS image, and it works!

2019-12-09 00:48:28,351 - DataSourceVMwareGuestInfo.py[INFO]: clearing guestinfo.userdata
2019-12-09 00:48:28,351 - DataSourceVMwareGuestInfo.py[DEBUG]: Setting guestinfo key=userdata to value=---
2019-12-09 00:48:28,351 - util.py[DEBUG]: Running command ['/usr/bin/vmware-rpctool', 'info-set guestinfo.userdata ---'] with allowed return codes [0] (shell=False, capture=True)
2019-12-09 00:48:28,364 - DataSourceVMwareGuestInfo.py[INFO]: clearing guestinfo.userdata.encoding
2019-12-09 00:48:28,364 - DataSourceVMwareGuestInfo.py[DEBUG]: Setting guestinfo key=userdata.encoding to value= 
2019-12-09 00:48:28,364 - util.py[DEBUG]: Running command ['/usr/bin/vmware-rpctool', 'info-set guestinfo.userdata.encoding  '] with allowed return codes [0] (shell=False, capture=True)

And here I go reading the guestinfo.userdata field directly:

$ vmware-rpctool "info-get guestinfo.userdata"
---

And Kubernetes was initialized successfully (I booted the machine image without CAPV, so the add-ons were not configured, hence the pending CoreDNS pods):

$ sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get all -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-5644d7b6d9-4l6bz          0/1     Pending   0          4m16s
kube-system   pod/coredns-5644d7b6d9-d45g5          0/1     Pending   0          4m16s
kube-system   pod/etcd-capi.vm                      1/1     Running   0          3m37s
kube-system   pod/kube-apiserver-capi.vm            1/1     Running   0          3m16s
kube-system   pod/kube-controller-manager-capi.vm   1/1     Running   0          3m37s
kube-system   pod/kube-proxy-7thsf                  1/1     Running   0          4m16s
kube-system   pod/kube-scheduler-capi.vm            1/1     Running   0          3m24s

NAMESPACE     NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   100.64.0.1    <none>        443/TCP                  4m32s
kube-system   service/kube-dns     ClusterIP   100.64.0.10   <none>        53/UDP,53/TCP,9153/TCP   4m31s

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
kube-system   daemonset.apps/kube-proxy   1         1         1       1            1           beta.kubernetes.io/os=linux   4m31s

NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           4m31s

NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5644d7b6d9   2         2         0       4m16s

akutz avatar Dec 09 '19 00:12 akutz

What ended up being the end result of deleting the files off the file system?

voor avatar Dec 09 '19 14:12 voor

The problem is that I am not aware of the consequences that might occur, possible side-effects, of removing the user data file(s) cloud-init creates on the local filesystem under /var/lib/cloud. @detiber, I know removing /var/lib/cloud is part of getting cloud-init to run again as if a fresh system, but do you know what happens if we just remove the user data files and leave the rest?

I wouldn't expect any issues as long as this is done at the end of the final stage. There shouldn't be anything else making use of the userdata on the local system at that point.

detiber avatar Dec 09 '19 15:12 detiber

/assign @yastij

akutz avatar Jan 10 '20 18:01 akutz

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Apr 09 '20 19:04 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar May 09 '20 19:05 fejta-bot

/remove-lifecycle rotten

voor avatar May 09 '20 20:05 voor

/lifecycle frozen

randomvariable avatar May 11 '20 13:05 randomvariable