cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
:bug: Write sensitive cloud-init user-data into /etc/cloud/cloud.cfg.d
What type of PR is this? /kind bug
What this PR does / why we need it:
The boothook fetches sensitive user-data from an AWS service (Secrets Manager, or SSM Parameter Store). This PR changes the mechanism by the way this user-data is passed to cloud-init once it's fetched.
Previously, the boothook wrote the sensitive user-data to /etc/secret-userdata.txt, and cloud-init read it via an #include directive. Now, the boothook writes it to /etc/cloud/cloud.cfg.d/99_kubeadm_bootstrap.cfg. The directory is a well-documented configuration source used by cloud-init, and exists wherever cloud-init is installed. The file is given the prefix 99_ to give it high priority over other configuration in that directory.
Previously, cloud-init read sensitive user-data from /etc/secret-userdata.txt via an #include directive. Now, it reads the sensitive user-data simply because it is located in the /etc/cloud/cloud.cfg.d directory. Therefore, the #include directive is no longer used, and is removed.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4745
Special notes for your reviewer:
If we merge this PR, we can revert the workaround introduced in https://github.com/kubernetes-sigs/image-builder/pull/406.
Checklist:
- [x] squashed commits
- [x] includes documentation
- [x] includes emojis
- [ ] adds unit tests
- [ ] adds or updates e2e tests
Release note:
Changes the mechanism to pass sensitive user-data to cloud-init, making CAPA compatible with cloud-init v23.3 and newer.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from dlipovetsky. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
This change must be validated e2e. I've already tested it using my own AWS account, so I'm confident it will pass e2e.
/test pull-cluster-api-provider-aws-e2e
/cc @randomvariable You know this area. Tagging you, in case you have questions/concerns about this change.
I'd like to backport this to supported release branches, too.
/milestone v2.4.0
/lgtm
/retest
This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.
/lgtm
/retest
/test pull-cluster-api-provider-aws-e2e
@dlipovetsky looks like E2E tests needs to be fixed
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-apidiff-main
Last e2e failure was due to reaching EventBridge resource quota. From the manager log:
E0204 12:33:36.396548 1 awscluster_controller.go:309] "non-fatal: failed to set up EventBridge" err="unable to create rule: LimitExceededException: The requested resource exceeds the maximum number allowed." controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8" namespace="functional-test-multi-az-nacdlz" name="functional-test-multi-az-k1r9x8" reconcileID="83e72c26-318f-4b1e-8575-eaddab4426f4" cluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8"
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e
Checking if the fix to https://github.com/kubernetes/k8s.io/issues/6381 has an effect.
/retest
Doesn't look like this failure was from the event bridge issue.
E0207 17:24:40.182327 1 controller.go:329] "Reconciler error" err="failed to retrieve bootstrap data secret machine-pool-fiyovf-mp-0 for AWSMachinePool machine-pool-oevgy9/machine-pool-fiyovf-mp-0: Secret \"machine-pool-fiyovf-mp-0\" not found" controller="awsmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachinePool" AWSMachinePool="machine-pool-oevgy9/machine-pool-fiyovf-mp-0" namespace="machine-pool-oevgy9" name="machine-pool-fiyovf-mp-0" reconcileID="3ef4b571-b25a-447c-b039-3e088809eb3a"
/retest
/retest
/test pull-cluster-api-provider-aws-e2e
I'm going to rebase on main, in case there have been some changes that affect e2e.
New changes are detected. LGTM label has been removed.
This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.
This might be possible if we implement our own Part Handler that calls the Secrets or SSM service.
/test pull-cluster-api-provider-aws-e2e
/retest
I still have no idea why the same 7 tests consistently fail.
/test pull-cluster-api-provider-aws-build-docker
@dlipovetsky maybe you could try rebasing the PR and then run the E2E tests?