cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

✨ Add AWSMachines to back the ec2 instances in AWSMachinePools

Open cnmcavoy opened this issue 2 years ago • 10 comments
trafficstars

What type of PR is this? /kind feature

What this PR does / why we need it: Implements the MachinePool Machines clusterAPI proposal

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #4184

Special notes for your reviewer:

Checklist:

  • [X] squashed commits
  • [ ] includes documentation
  • [x] adds unit tests
  • [x] adds or updates e2e tests

Release note:

Added AWSMachines to back the ec2 instances in AWSMachinePools and AWSManagedMachinePools.

cnmcavoy avatar Sep 27 '23 21:09 cnmcavoy

Hi @cnmcavoy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 27 '23 21:09 k8s-ci-robot

~Still missing new tests to cover the functionality, have been testing locally with tilt~

Edit: tests have been added.

cnmcavoy avatar Sep 27 '23 21:09 cnmcavoy

/ok-to-test

Skarlso avatar Sep 28 '23 05:09 Skarlso

/assign

Skarlso avatar Oct 13 '23 04:10 Skarlso

@cnmcavoy is there a final review pending for this PR, or any other items?

Ankitasw avatar Feb 01 '24 06:02 Ankitasw

@Ankitasw I think all the questions have been answered, it probably needs a new reviewer to give it a look over for a lgtm. It's ready to be merged if there are no more review items to address.

cnmcavoy avatar Feb 01 '24 16:02 cnmcavoy

I can have a look into this, @cnmcavoy – do you want to resolve conflicts first?

AndiDog avatar Apr 16 '24 19:04 AndiDog

I can have a look into this, @cnmcavoy – do you want to resolve conflicts first?

Sure, I think I got them all sorted.

cnmcavoy avatar Apr 16 '24 21:04 cnmcavoy

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 15 '24 22:07 k8s-triage-robot

/remove-lifecycle stale

AndiDog avatar Jul 16 '24 08:07 AndiDog

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign richardcase for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 22 '24 19:07 k8s-ci-robot

As an update: I'll try to test this major feature. PR looks mostly fine, I think. I'm only unsure about the owner reference in the non-happy case.

AndiDog avatar Aug 13 '24 09:08 AndiDog

I got a working test environment, and the AWSMachine/Machine objects get created correctly 🍀.

First blocking issue I found: deletion by downscaling the ASG/AWSMachinePool leads to this permanent error rather than deleting the AWSMachine:

  failureMessage: EC2 instance state "terminated" is unexpected
  failureReason: UpdateError
  instanceState: terminated
  ready: false

The Machine/AWSMachine combo also can't be deleted due to:

capa_control… │ I0822 16:04:49.507270       1 awsmachine_controller.go:303] "Handling deleted AWSMachine"
capa_control… │ E0822 16:04:49.508449       1 awsmachine_controller.go:308] "unable to delete machine" err="failed to get raw userdata: error retrieving bootstrap data: linked Machine's bootstrap.dataSecretName is nil"

Other than that, the Node object went away and the cluster reacted according to the node deletion. @cnmcavoy what happened when you tested this? And which other cases do you think should be tested? I guess cluster-autoscaler scaling of ASGs may be a good external trigger that CAPA should have no problem with – I can test that.

AndiDog avatar Aug 22 '24 16:08 AndiDog

@AndiDog thanks for taking a look. I think that's slightly concerning bc there should be test cases covering that sort of condition.

I suspect this PR has been left too long and the code has rotted. Indeed doesn't use ASGs anymore for autoscaling, so I can't really justify the time investment it would take to cut a new PR and start this fresh.

cnmcavoy avatar Aug 27 '24 21:08 cnmcavoy