cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

:bug: Write sensitive cloud-init user-data into /etc/cloud/cloud.cfg.d

Open dlipovetsky opened this issue 1 year ago • 42 comments
trafficstars

What type of PR is this? /kind bug

What this PR does / why we need it:

The boothook fetches sensitive user-data from an AWS service (Secrets Manager, or SSM Parameter Store). This PR changes the mechanism by the way this user-data is passed to cloud-init once it's fetched.

Previously, the boothook wrote the sensitive user-data to /etc/secret-userdata.txt, and cloud-init read it via an #include directive. Now, the boothook writes it to /etc/cloud/cloud.cfg.d/99_kubeadm_bootstrap.cfg. The directory is a well-documented configuration source used by cloud-init, and exists wherever cloud-init is installed. The file is given the prefix 99_ to give it high priority over other configuration in that directory.

Previously, cloud-init read sensitive user-data from /etc/secret-userdata.txt via an #include directive. Now, it reads the sensitive user-data simply because it is located in the /etc/cloud/cloud.cfg.d directory. Therefore, the #include directive is no longer used, and is removed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #4745

Special notes for your reviewer:

If we merge this PR, we can revert the workaround introduced in https://github.com/kubernetes-sigs/image-builder/pull/406.

Checklist:

  • [x] squashed commits
  • [x] includes documentation
  • [x] includes emojis
  • [ ] adds unit tests
  • [ ] adds or updates e2e tests

Release note:

Changes the mechanism to pass sensitive user-data to cloud-init, making CAPA compatible with cloud-init v23.3 and newer.

dlipovetsky avatar Jan 18 '24 22:01 dlipovetsky

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from dlipovetsky. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jan 18 '24 22:01 k8s-ci-robot

This change must be validated e2e. I've already tested it using my own AWS account, so I'm confident it will pass e2e.

/test pull-cluster-api-provider-aws-e2e

dlipovetsky avatar Jan 18 '24 23:01 dlipovetsky

/cc @randomvariable You know this area. Tagging you, in case you have questions/concerns about this change.

dlipovetsky avatar Jan 19 '24 02:01 dlipovetsky

I'd like to backport this to supported release branches, too.

dlipovetsky avatar Jan 19 '24 18:01 dlipovetsky

/milestone v2.4.0

richardcase avatar Jan 23 '24 14:01 richardcase

/lgtm

AndiDog avatar Jan 24 '24 07:01 AndiDog

/retest

faiq avatar Jan 24 '24 14:01 faiq

This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.

faiq avatar Jan 24 '24 14:01 faiq

/lgtm

faiq avatar Jan 24 '24 14:01 faiq

/retest

dlipovetsky avatar Jan 25 '24 19:01 dlipovetsky

/test pull-cluster-api-provider-aws-e2e

Ankitasw avatar Jan 31 '24 10:01 Ankitasw

@dlipovetsky looks like E2E tests needs to be fixed

Ankitasw avatar Feb 01 '24 06:02 Ankitasw

/test pull-cluster-api-provider-aws-e2e

richardcase avatar Feb 04 '24 12:02 richardcase

/test pull-cluster-api-provider-aws-apidiff-main

richardcase avatar Feb 04 '24 12:02 richardcase

Last e2e failure was due to reaching EventBridge resource quota. From the manager log:

E0204 12:33:36.396548       1 awscluster_controller.go:309] "non-fatal: failed to set up EventBridge" err="unable to create rule: LimitExceededException: The requested resource exceeds the maximum number allowed." controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8" namespace="functional-test-multi-az-nacdlz" name="functional-test-multi-az-k1r9x8" reconcileID="83e72c26-318f-4b1e-8575-eaddab4426f4" cluster="functional-test-multi-az-nacdlz/functional-test-multi-az-k1r9x8"

dlipovetsky avatar Feb 05 '24 20:02 dlipovetsky

/test pull-cluster-api-provider-aws-e2e

richardcase avatar Feb 06 '24 13:02 richardcase

/test pull-cluster-api-provider-aws-e2e

nrb avatar Feb 06 '24 20:02 nrb

Checking if the fix to https://github.com/kubernetes/k8s.io/issues/6381 has an effect.

/retest

dlipovetsky avatar Feb 07 '24 16:02 dlipovetsky

Doesn't look like this failure was from the event bridge issue.

E0207 17:24:40.182327 1 controller.go:329] "Reconciler error" err="failed to retrieve bootstrap data secret machine-pool-fiyovf-mp-0 for AWSMachinePool machine-pool-oevgy9/machine-pool-fiyovf-mp-0: Secret \"machine-pool-fiyovf-mp-0\" not found" controller="awsmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachinePool" AWSMachinePool="machine-pool-oevgy9/machine-pool-fiyovf-mp-0" namespace="machine-pool-oevgy9" name="machine-pool-fiyovf-mp-0" reconcileID="3ef4b571-b25a-447c-b039-3e088809eb3a"

nrb avatar Feb 07 '24 18:02 nrb

/retest

dlipovetsky avatar Feb 10 '24 00:02 dlipovetsky

/retest

dlipovetsky avatar Feb 12 '24 20:02 dlipovetsky

/test pull-cluster-api-provider-aws-e2e

richardcase avatar Feb 14 '24 15:02 richardcase

I'm going to rebase on main, in case there have been some changes that affect e2e.

dlipovetsky avatar Feb 15 '24 17:02 dlipovetsky

New changes are detected. LGTM label has been removed.

k8s-ci-robot avatar Feb 15 '24 17:02 k8s-ci-robot

This makes sense, but I think further down the line we might want to rethink restarting the cloud-init process.

This might be possible if we implement our own Part Handler that calls the Secrets or SSM service.

dlipovetsky avatar Feb 15 '24 17:02 dlipovetsky

/test pull-cluster-api-provider-aws-e2e

dlipovetsky avatar Feb 15 '24 17:02 dlipovetsky

/retest

nrb avatar Feb 20 '24 16:02 nrb

I still have no idea why the same 7 tests consistently fail.

dlipovetsky avatar Feb 21 '24 00:02 dlipovetsky

/test pull-cluster-api-provider-aws-build-docker

nrb avatar Mar 04 '24 17:03 nrb

@dlipovetsky maybe you could try rebasing the PR and then run the E2E tests?

Ankitasw avatar Mar 07 '24 07:03 Ankitasw