cluster-api-provider-aws :bug: Write sensitive cloud-init user-data into /etc/cloud/cloud.cfg.d

trafficstars

What type of PR is this? /kind bug

What this PR does / why we need it:

The boothook fetches sensitive user-data from an AWS service (Secrets Manager, or SSM Parameter Store). This PR changes the mechanism by the way this user-data is passed to cloud-init once it's fetched.

Previously, the boothook wrote the sensitive user-data to /etc/secret-userdata.txt, and cloud-init read it via an #include directive. Now, the boothook writes it to /etc/cloud/cloud.cfg.d/99_kubeadm_bootstrap.cfg. The directory is a well-documented configuration source used by cloud-init, and exists wherever cloud-init is installed. The file is given the prefix 99_ to give it high priority over other configuration in that directory.

Previously, cloud-init read sensitive user-data from /etc/secret-userdata.txt via an #include directive. Now, it reads the sensitive user-data simply because it is located in the /etc/cloud/cloud.cfg.d directory. Therefore, the #include directive is no longer used, and is removed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #4745

Special notes for your reviewer:

If we merge this PR, we can revert the workaround introduced in https://github.com/kubernetes-sigs/image-builder/pull/406.

Checklist:

[x] squashed commits
[x] includes documentation
[x] includes emojis
[ ] adds unit tests
[ ] adds or updates e2e tests

Release note:

Changes the mechanism to pass sensitive user-data to cloud-init, making CAPA compatible with cloud-init v23.3 and newer.

Jan 18 '24 22:01 dlipovetsky

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from dlipovetsky. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jan 18 '24 22:01 k8s-ci-robot

This change must be validated e2e. I've already tested it using my own AWS account, so I'm confident it will pass e2e.

/test pull-cluster-api-provider-aws-e2e

Jan 18 '24 23:01 dlipovetsky

/cc @randomvariable You know this area. Tagging you, in case you have questions/concerns about this change.

Jan 19 '24 02:01 dlipovetsky

I'd like to backport this to supported release branches, too.

Jan 19 '24 18:01 dlipovetsky

/milestone v2.4.0

Jan 23 '24 14:01 richardcase

/lgtm

Jan 24 '24 07:01 AndiDog

/retest

Jan 24 '24 14:01 faiq

This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.

Jan 24 '24 14:01 faiq

/lgtm

Jan 24 '24 14:01 faiq

/retest

Jan 25 '24 19:01 dlipovetsky

/test pull-cluster-api-provider-aws-e2e

Jan 31 '24 10:01 Ankitasw

@dlipovetsky looks like E2E tests needs to be fixed

Feb 01 '24 06:02 Ankitasw

/test pull-cluster-api-provider-aws-e2e

Feb 04 '24 12:02 richardcase

/test pull-cluster-api-provider-aws-apidiff-main

Feb 04 '24 12:02 richardcase

Last e2e failure was due to reaching EventBridge resource quota. From the manager log:

E0204 12:33:36.396548       1 awscluster_controller.go:309] "non-fatal: failed to set up EventBridge" err="unable to create rule: LimitExceededException: The requested resource exceeds the maximum number allowed." controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8" namespace="functional-test-multi-az-nacdlz" name="functional-test-multi-az-k1r9x8" reconcileID="83e72c26-318f-4b1e-8575-eaddab4426f4" cluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8"

Feb 05 '24 20:02 dlipovetsky

/test pull-cluster-api-provider-aws-e2e

Feb 06 '24 13:02 richardcase

/test pull-cluster-api-provider-aws-e2e

Feb 06 '24 20:02 nrb

Checking if the fix to https://github.com/kubernetes/k8s.io/issues/6381 has an effect.

/retest

Feb 07 '24 16:02 dlipovetsky

Doesn't look like this failure was from the event bridge issue.

E0207 17:24:40.182327 1 controller.go:329] "Reconciler error" err="failed to retrieve bootstrap data secret machine-pool-fiyovf-mp-0 for AWSMachinePool machine-pool-oevgy9/machine-pool-fiyovf-mp-0: Secret \"machine-pool-fiyovf-mp-0\" not found" controller="awsmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachinePool" AWSMachinePool="machine-pool-oevgy9/machine-pool-fiyovf-mp-0" namespace="machine-pool-oevgy9" name="machine-pool-fiyovf-mp-0" reconcileID="3ef4b571-b25a-447c-b039-3e088809eb3a"

Feb 07 '24 18:02 nrb

/retest

Feb 10 '24 00:02 dlipovetsky

/retest

Feb 12 '24 20:02 dlipovetsky

/test pull-cluster-api-provider-aws-e2e

Feb 14 '24 15:02 richardcase

I'm going to rebase on main, in case there have been some changes that affect e2e.

Feb 15 '24 17:02 dlipovetsky

New changes are detected. LGTM label has been removed.

Feb 15 '24 17:02 k8s-ci-robot

This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.

This might be possible if we implement our own Part Handler that calls the Secrets or SSM service.

Feb 15 '24 17:02 dlipovetsky

/test pull-cluster-api-provider-aws-e2e

Feb 15 '24 17:02 dlipovetsky

/retest

Feb 20 '24 16:02 nrb

I still have no idea why the same 7 tests consistently fail.

Feb 21 '24 00:02 dlipovetsky

/test pull-cluster-api-provider-aws-build-docker

Mar 04 '24 17:03 nrb

@dlipovetsky maybe you could try rebasing the PR and then run the E2E tests?

Mar 07 '24 07:03 Ankitasw

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

:bug: Write sensitive cloud-init user-data into /etc/cloud/cloud.cfg.d

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard