cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

`clusterctl move` not compatible with `AWSMachinePools`

Open AverageMarcus opened this issue 3 years ago • 1 comments
trafficstars

/kind bug

What steps did you take and what happened:

  1. Spin up a bootstrap cluster in Kind
  2. Create a new target cluster with at least one AWSMachinePool defined
  3. Wait for target cluster to be created and ready
  4. Perform a pivot of the cluster using clusterctl move so the target cluster is self-managing.
  5. The following error will be reported:
Performing move...
Discovering Cluster API objects
Moving Cluster API objects Clusters=1
Moving Cluster API objects ClusterClasses=0
Creating objects in the target cluster
Error: [action failed after 10 attempts: error creating "infrastructure.cluster.x-k8s.io/v1beta1, Kind=AWSMachinePool" default/golem-def00a: admission webhook "validation.awsmachinepool.infrastructure.cluster.x-k8s.io" denied the request: AWSMachinePool.infrastructure.cluster.x-k8s.io "golem-def00a" is invalid: spec.awsLaunchTemplate.rootVolume.deviceName: Forbidden: root volume shouldn't have device name, action failed after 10 attempts: error creating "infrastructure.cluster.x-k8s.io/v1beta1, Kind=AWSMachinePool" default/golem-def00b: admission webhook "validation.awsmachinepool.infrastructure.cluster.x-k8s.io" denied the request: AWSMachinePool.infrastructure.cluster.x-k8s.io "golem-def00b" is invalid: spec.awsLaunchTemplate.rootVolume.deviceName: Forbidden: root volume shouldn't have device name, action failed after 10 attempts: error creating "infrastructure.cluster.x-k8s.io/v1beta1, Kind=AWSMachinePool" default/golem-def00c: admission webhook "validation.awsmachinepool.infrastructure.cluster.x-k8s.io" denied the request: AWSMachinePool.infrastructure.cluster.x-k8s.io "golem-def00c" is invalid: spec.awsLaunchTemplate.rootVolume.deviceName: Forbidden: root volume shouldn't have device name]

What did you expect to happen: All resources moved to the target cluster successfully.

Anything else you would like to add: The rootVolume.deviceName is initially not provided when first creating the cluster resources in the bootstrap cluster. Once the AWS Launch Template has been created the details of the root volume are retrieved and the deviceName value is populated on the AWSMachinePool resource(s). When it comes to moving to the new cluster, the property remains populated and is then blocked by the admission webhook, preventing the move completing.

This value only seems to be used during the initial setup of the Launch Template and as far as I can see is never referenced by anything else after that. Manually removing the deviceName property from the AWSMachinePool resources allows the move to be performed but the value is never re-populated again as it's only fetched when initially creating the Launch Template.

Also discussed on Slack: https://kubernetes.slack.com/archives/CD6U2V71N/p1658902259480619

Environment:

  • Cluster-api-provider-aws version: v1.4.1
  • Cluster-api version: v1.1.5
  • clusterctl version: v1.2.0
  • Kubernetes version: (use kubectl version): v1.21.1
  • OS (e.g. from /etc/os-release): Ubuntu

AverageMarcus avatar Jul 27 '22 15:07 AverageMarcus

Thanks for reporting this issue!

This is happening because the deviceName field under rootVolume section is not allowed to be non-nil during creation, but is set by the controllers. During clusterctl move, with that field set, creation fails.

For the proper fix, we need to wait for v1beta2 release, as we need webhook/field changes. But as a workaround, before the move, if users manually delete the deviceName, it won't get readded by the controllers, and move succeeds.

/triage accepted

sedefsavas avatar Jul 27 '22 16:07 sedefsavas

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 25 '22 16:10 k8s-triage-robot