cluster-api-provider-aws Machine with cloud-init 23.3.0 or newer fails to join cluster

trafficstars

/kind bug

What steps did you take and what happened:

I used https://github.com/kubernetes-sigs/image-builder/ to create an Ubuntu 20.04 AMI with the latest available cloud-init package, 23.3.3. The machine fails to join the cluster.

What did you expect to happen:

The machine should join the cluster.

Anything else you would like to add:

In https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1490, CAPA began writing sensitive user-data to AWS Secrets Manager (https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1924 added support for an alternative, the SSM Parameter Store). CAPA replaced the user-data produced by CABPK with a mechanism to fetch the user-data from the service. This mechanism relied on an "include" that would, by design, fail the first time cloud-init ran. CAPA relied on cloud-init ignoring the failure.

As of https://github.com/canonical/cloud-init/pull/367, cloud-init stopped ignoring the failure by default, but introduced a feature flag that allowed cloud-init to ignore the failure, as it had in the past. The default settings caused the cloud-init boot to fail, and https://github.com/kubernetes-sigs/image-builder/pull/406 used the feature flag as a work around.

More recently, as of https://github.com/canonical/cloud-init/pull/4228, the feature flag itself was removed. Without the feature flag, the existing workaround has no effect, and cloud-init boot fails.

@supershal and I looked into this issue, and filed https://github.com/kubernetes-sigs/image-builder/issues/1333. We finally understand the root cause.

The most CAPA-maintained AMIs were created with cloud-init 22.4.2, instead of the default cloud-init version.

Environment:

Cluster-api-provider-aws version: main
Kubernetes version: (use kubectl version): v1.27.8
OS (e.g. from /etc/os-release): Ubuntu 20.04

Jan 18 '24 22:01 dlipovetsky

/triage accepted /priority important-soon

Jan 18 '24 22:01 dlipovetsky

/assign @dlipovetsky

Jan 18 '24 22:01 dlipovetsky

This affects cloud-init v23.3.0 and newer. See https://github.com/canonical/cloud-init/blob/23.3.x/ChangeLog#L98

Mar 02 '24 00:03 dlipovetsky

#4746 is a hack, but it's arguably an improvement over #1490, which (eventually) required us to modify cloud-init internals in order to work.

Frankly, if we don't like #4746, let's consider reverting the functionality in #1490 and #1924. By design, the bootstrap provider passes secrets in user-data, and the infrastructure provider is not in a position to interpose, without hacks. I think this is something to be discussed at the bootstrap provider level. This is, after all, a problem that affects all infra providers that rely on cloud-init user-data.

Mar 22 '24 01:03 dlipovetsky

We would not need to interpose cloud-init, if the user-data did not contain the sensitive data (bootstrap token). See https://github.com/kubernetes-sigs/cluster-api/issues/5294 and https://github.com/kubernetes-sigs/cluster-api/issues/9631

Mar 22 '24 17:03 dlipovetsky

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Jun 20 '24 18:06 k8s-triage-robot

/triage accepted /priority important-soon

Sep 17 '24 00:09 dlipovetsky

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Dec 16 '24 01:12 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 16 '25 01:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 15 '25 02:04 k8s-triage-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

Machine with cloud-init 23.3.0 or newer fails to join cluster

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard