cluster-api-provider-aws ✨ Add AWSMachines to back the ec2 instances in AWSMachinePools

trafficstars

What type of PR is this? /kind feature

What this PR does / why we need it: Implements the MachinePool Machines clusterAPI proposal

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #4184

Special notes for your reviewer:

Checklist:

[X] squashed commits
[ ] includes documentation
[x] adds unit tests
[x] adds or updates e2e tests

Release note:

Added AWSMachines to back the ec2 instances in AWSMachinePools and AWSManagedMachinePools.

Sep 27 '23 21:09 cnmcavoy

Hi @cnmcavoy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 27 '23 21:09 k8s-ci-robot

~Still missing new tests to cover the functionality, have been testing locally with tilt~

Edit: tests have been added.

Sep 27 '23 21:09 cnmcavoy

/ok-to-test

Sep 28 '23 05:09 Skarlso

/assign

Oct 13 '23 04:10 Skarlso

@cnmcavoy is there a final review pending for this PR, or any other items?

Feb 01 '24 06:02 Ankitasw

@Ankitasw I think all the questions have been answered, it probably needs a new reviewer to give it a look over for a lgtm. It's ready to be merged if there are no more review items to address.

Feb 01 '24 16:02 cnmcavoy

I can have a look into this, @cnmcavoy – do you want to resolve conflicts first?

Apr 16 '24 19:04 AndiDog

I can have a look into this, @cnmcavoy – do you want to resolve conflicts first?

Sure, I think I got them all sorted.

Apr 16 '24 21:04 cnmcavoy

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 15 '24 22:07 k8s-triage-robot

/remove-lifecycle stale

Jul 16 '24 08:07 AndiDog

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign richardcase for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jul 22 '24 19:07 k8s-ci-robot

As an update: I'll try to test this major feature. PR looks mostly fine, I think. I'm only unsure about the owner reference in the non-happy case.

Aug 13 '24 09:08 AndiDog

I got a working test environment, and the AWSMachine/Machine objects get created correctly 🍀.

First blocking issue I found: deletion by downscaling the ASG/AWSMachinePool leads to this permanent error rather than deleting the AWSMachine:

  failureMessage: EC2 instance state "terminated" is unexpected
  failureReason: UpdateError
  instanceState: terminated
  ready: false

The Machine/AWSMachine combo also can't be deleted due to:

capa_control… │ I0822 16:04:49.507270       1 awsmachine_controller.go:303] "Handling deleted AWSMachine"
capa_control… │ E0822 16:04:49.508449       1 awsmachine_controller.go:308] "unable to delete machine" err="failed to get raw userdata: error retrieving bootstrap data: linked Machine's bootstrap.dataSecretName is nil"

Other than that, the Node object went away and the cluster reacted according to the node deletion. @cnmcavoy what happened when you tested this? And which other cases do you think should be tested? I guess cluster-autoscaler scaling of ASGs may be a good external trigger that CAPA should have no problem with – I can test that.

Aug 22 '24 16:08 AndiDog

@AndiDog thanks for taking a look. I think that's slightly concerning bc there should be test cases covering that sort of condition.

I suspect this PR has been left too long and the code has rotted. Indeed doesn't use ASGs anymore for autoscaling, so I can't really justify the time investment it would take to cut a new PR and start this fresh.

Aug 27 '24 21:08 cnmcavoy

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

✨ Add AWSMachines to back the ec2 instances in AWSMachinePools

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard