cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard
Delete bootstrap data after a VM is online
/kind feature
Describe the solution you'd like
After deploying a VM through Cluster API I would like to empty out the guestinfo.userdata as it contains sensitive information like private keys, the vsphere password, etc.
Right now:
export VM=/Datacenter/vm/k8s/management-cluster-controlplane-0
govc vm.info -json ${VM} | jq -r '.VirtualMachines[].Config.ExtraConfig[] | select(.Key == "guestinfo.userdata").Value' | base64 --decode
# Secrets and stuff
Temporary fix:
govc vm.change -vm ${VM} -e guestinfo.userdata=""
Anything else you would like to add:
Environment:
- Cluster-api-provider-vsphere version:
v0.5.1 - Kubernetes version: (use
kubectl version):1.15.3 - OS (e.g. from
/etc/os-release): Ubuntu
/assign @akutz
Related to kubernetes-sigs/cluster-api#1739
A quick solution is adding the following command into cloud-init before the kubeadm command:
// Clear guestinfo.userdata after it's read by cloud-init to prevent security leaks.
// There are 2 spaces at the end of this vmtoolsd command. 1 space or no space doesn't work.
// vmtoolsd does not support removing a key under guestinfo, so we set the value to a single space character.
cmdClearUserData = `vmtoolsd --cmd 'info-set guestinfo.userdata '`
Hi @jessehu,
That's a really good suggestion, but it would be required to be placed in the generated KubeadmConfig resource ahead of time. Plus, we're trying to move away from vmtoolsd in favor of vmware-rpctool as the former seems to have performance issues, especially on Photon (see https://github.com/vmware/cloud-init-vmware-guestinfo/pull/23 for more information).
My current thought is to remove the data via a reconfigure call after the machine is online.
Thoughts?
Thanks @akutz for the notice. Make sense. We will need to find out the earliest time when CAPV controller can safely delete userdata, because the sooner the safer.
Absolutely. My other thought is to actually update the datasource so I can pass in a flag via the metadata, which is built at runtime in CAPV. If the flag is set, then it will automatically add the vmware-rpctool command to the end of the user data's runcmd list to delete guestinfo.userdata.
For example, imagine the following metadata:
cleanup-userdata: true
instance-id: iid-capi
local-hostname: capi.vm
network:
version: 2
ethernets:
nics:
match:
name: ens*
dhcp4: yes
dhcp6: yes
The flag guestinfo.metadata.cleanup-userdata set to true should cause the following command to be inserted as the first element in the user data's runcmd list:
runcmd:
- "vmware-rpctool \"info-set guestinfo.userdata \""
- "hostname \"{{ ds.meta_data.hostname }}\""
- "echo \"::1 ipv6-localhost ipv6-loopback\" >/etc/hosts"
- "echo \"127.0.0.1 localhost {{ ds.meta_data.hostname }}\" >>/etc/hosts"
- "echo \"{{ ds.meta_data.hostname }}\" >/etc/hostname"
- 'kubeadm init --config /tmp/kubeadm.yaml'
I've verified the above command will correctly clear the user data. I'm using sh below to illustrate this works so I can duplicate the escaped " characters as they are above:
$ sh -c "vmware-rpctool \"info-set guestinfo.userdata hi\""
$ sh -c "vmware-rpctool \"info-get guestinfo.userdata\"" 2>&1 | grep -v '^[[:space:]]\{1,\}$' || echo empty
hi
$ sh -c "vmware-rpctool \"info-set guestinfo.userdata \""
$ sh -c "vmware-rpctool \"info-get guestinfo.userdata\"" 2>&1 | grep -v '^[[:space:]]\{1,\}$' || echo empty
empty
What do you think?
Hi @jessehu,
It's worth noting that all of the options we've discussed thus far will still leave the user data file(s) on disk that cloud-init writes on first-boot, /var/lib/cloud/instance/user-data.txt*.
Hi @jessehu,
Perhaps we should augment the datasource metadata with a more in-depth user data cleanup option? For example:
cleanup-userdata:
guestinfo: true|false
filesystem: true|false
The problem is that I am not aware of the consequences that might occur, possible side-effects, of removing the user data file(s) cloud-init creates on the local filesystem under /var/lib/cloud. @detiber, I know removing /var/lib/cloud is part of getting cloud-init to run again as if a fresh system, but do you know what happens if we just remove the user data files and leave the rest?
Hi @jessehu,
Please take a look at https://github.com/vmware/cloud-init-vmware-guestinfo/pull/25. It should account for this. As soon as we can merge it and build new images, we will be able to leverage the feature to remove the userdata as early as possible.
Thanks @akutz. I added some minor comments for https://github.com/vmware/cloud-init-vmware-guestinfo/pull/25, please take a look.
Good news, I tested https://github.com/vmware/cloud-init-vmware-guestinfo/pull/25 with a newly built CentOS image, and it works!
2019-12-09 00:48:28,351 - DataSourceVMwareGuestInfo.py[INFO]: clearing guestinfo.userdata
2019-12-09 00:48:28,351 - DataSourceVMwareGuestInfo.py[DEBUG]: Setting guestinfo key=userdata to value=---
2019-12-09 00:48:28,351 - util.py[DEBUG]: Running command ['/usr/bin/vmware-rpctool', 'info-set guestinfo.userdata ---'] with allowed return codes [0] (shell=False, capture=True)
2019-12-09 00:48:28,364 - DataSourceVMwareGuestInfo.py[INFO]: clearing guestinfo.userdata.encoding
2019-12-09 00:48:28,364 - DataSourceVMwareGuestInfo.py[DEBUG]: Setting guestinfo key=userdata.encoding to value=
2019-12-09 00:48:28,364 - util.py[DEBUG]: Running command ['/usr/bin/vmware-rpctool', 'info-set guestinfo.userdata.encoding '] with allowed return codes [0] (shell=False, capture=True)
And here I go reading the guestinfo.userdata field directly:
$ vmware-rpctool "info-get guestinfo.userdata"
---
And Kubernetes was initialized successfully (I booted the machine image without CAPV, so the add-ons were not configured, hence the pending CoreDNS pods):
$ sudo kubectl --kubeconfig /etc/kubernetes/admin.conf get all -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5644d7b6d9-4l6bz 0/1 Pending 0 4m16s
kube-system pod/coredns-5644d7b6d9-d45g5 0/1 Pending 0 4m16s
kube-system pod/etcd-capi.vm 1/1 Running 0 3m37s
kube-system pod/kube-apiserver-capi.vm 1/1 Running 0 3m16s
kube-system pod/kube-controller-manager-capi.vm 1/1 Running 0 3m37s
kube-system pod/kube-proxy-7thsf 1/1 Running 0 4m16s
kube-system pod/kube-scheduler-capi.vm 1/1 Running 0 3m24s
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 4m32s
kube-system service/kube-dns ClusterIP 100.64.0.10 <none> 53/UDP,53/TCP,9153/TCP 4m31s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-proxy 1 1 1 1 1 beta.kubernetes.io/os=linux 4m31s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4m31s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5644d7b6d9 2 2 0 4m16s
What ended up being the end result of deleting the files off the file system?
The problem is that I am not aware of the consequences that might occur, possible side-effects, of removing the user data file(s) cloud-init creates on the local filesystem under /var/lib/cloud. @detiber, I know removing /var/lib/cloud is part of getting cloud-init to run again as if a fresh system, but do you know what happens if we just remove the user data files and leave the rest?
I wouldn't expect any issues as long as this is done at the end of the final stage. There shouldn't be anything else making use of the userdata on the local system at that point.
/assign @yastij
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
/remove-lifecycle rotten
/lifecycle frozen