cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Machine with cloud-init 23.3.0 or newer fails to join cluster

Open dlipovetsky opened this issue 1 year ago • 7 comments
trafficstars

/kind bug

What steps did you take and what happened:

I used https://github.com/kubernetes-sigs/image-builder/ to create an Ubuntu 20.04 AMI with the latest available cloud-init package, 23.3.3. The machine fails to join the cluster.

What did you expect to happen:

The machine should join the cluster.

Anything else you would like to add:

In https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1490, CAPA began writing sensitive user-data to AWS Secrets Manager (https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1924 added support for an alternative, the SSM Parameter Store). CAPA replaced the user-data produced by CABPK with a mechanism to fetch the user-data from the service. This mechanism relied on an "include" that would, by design, fail the first time cloud-init ran. CAPA relied on cloud-init ignoring the failure.

As of https://github.com/canonical/cloud-init/pull/367, cloud-init stopped ignoring the failure by default, but introduced a feature flag that allowed cloud-init to ignore the failure, as it had in the past. The default settings caused the cloud-init boot to fail, and https://github.com/kubernetes-sigs/image-builder/pull/406 used the feature flag as a work around.

More recently, as of https://github.com/canonical/cloud-init/pull/4228, the feature flag itself was removed. Without the feature flag, the existing workaround has no effect, and cloud-init boot fails.

@supershal and I looked into this issue, and filed https://github.com/kubernetes-sigs/image-builder/issues/1333. We finally understand the root cause.

The most CAPA-maintained AMIs were created with cloud-init 22.4.2, instead of the default cloud-init version.

Environment:

  • Cluster-api-provider-aws version: main
  • Kubernetes version: (use kubectl version): v1.27.8
  • OS (e.g. from /etc/os-release): Ubuntu 20.04

dlipovetsky avatar Jan 18 '24 22:01 dlipovetsky

/triage accepted /priority important-soon

dlipovetsky avatar Jan 18 '24 22:01 dlipovetsky

/assign @dlipovetsky

dlipovetsky avatar Jan 18 '24 22:01 dlipovetsky

This affects cloud-init v23.3.0 and newer. See https://github.com/canonical/cloud-init/blob/23.3.x/ChangeLog#L98

dlipovetsky avatar Mar 02 '24 00:03 dlipovetsky

#4746 is a hack, but it's arguably an improvement over #1490, which (eventually) required us to modify cloud-init internals in order to work.

Frankly, if we don't like #4746, let's consider reverting the functionality in #1490 and #1924. By design, the bootstrap provider passes secrets in user-data, and the infrastructure provider is not in a position to interpose, without hacks. I think this is something to be discussed at the bootstrap provider level. This is, after all, a problem that affects all infra providers that rely on cloud-init user-data.

dlipovetsky avatar Mar 22 '24 01:03 dlipovetsky

We would not need to interpose cloud-init, if the user-data did not contain the sensitive data (bootstrap token). See https://github.com/kubernetes-sigs/cluster-api/issues/5294 and https://github.com/kubernetes-sigs/cluster-api/issues/9631

dlipovetsky avatar Mar 22 '24 17:03 dlipovetsky

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Jun 20 '24 18:06 k8s-triage-robot

/triage accepted /priority important-soon

dlipovetsky avatar Sep 17 '24 00:09 dlipovetsky

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Dec 16 '24 01:12 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 16 '25 01:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Apr 15 '25 02:04 k8s-triage-robot