cluster-api-provider-aws 🐛: ec2/byoip: fix EIP leak when creating machine

What type of PR is this?

/kind bug

What this PR does / why we need it:

The instance creation flow is creating by default EIP to instances even if the BYO IP flow is set. BYO IPv4 creates and associates the EIP to instance after it is created, preventing paying for additional EIP (amazon-provided) when creating the instance when the BYO IPv4 Pool is defined to be used by the machine.

Furthermore, the fix provides additional checks to prevent duplicated EIP in the BYO IP reconciliation loop. The extra checks include running the EIP association many times, while the EIP is already associated, and failures in the log when running the EIP association prematurely - when the instance isn't ready, Eg ec2 in pending state.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #5038

Special notes for your reviewer:

Checklist:

[ ] squashed commits
[ ] includes documentation (N/A)
[x] includes emojis
[ ] adds unit tests
[ ] adds or updates e2e tests

Release note:

fix duplicated/leaked EIP when using BYO IPv4 on Machines.

Jun 27 '24 20:06 mtulio

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 27 '24 20:06 k8s-ci-robot

/ok-to-test

Jun 27 '24 20:06 nrb

/test ?

Jun 28 '24 04:06 mtulio

@mtulio: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-build-docker
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-apidiff-main
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks
/test pull-cluster-api-provider-aws-e2e-eks-gc
/test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-apidiff-main
pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-build-docker
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 28 '24 04:06 k8s-ci-robot

/test pull-cluster-api-provider-aws-e2e

Jun 28 '24 04:06 mtulio

Premature failure.

/test pull-cluster-api-provider-aws-e2e

Jun 28 '24 14:06 mtulio

/test pull-cluster-api-provider-aws-e2e

Jul 03 '24 15:07 mtulio

/test pull-cluster-api-provider-aws-e2e

Okay, previous test failures were flake. The latest run pass. OpenShift e2e BYOIP test is also passing install:

time="2024-06-28T19:16:32Z" level=debug msg="E0628 19:16:32.239704     331 
awsmachine_controller.go:544] \"failed to reconcile BYO Public IPv4\" 
err=\"unable to reconcile Elastic IP Pool to instance \\\"i-05af21b09d3552d0f\\\" with state: pending\""

[...]

time="2024-06-28T19:16:50Z" level=debug msg="I0628 19:16:50.432073     331 eip.go:44] 
\"machine is already associated with an Elastic IP with custom Public IPv4 pool\" 
controller=\"awsmachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" 
controllerKind=\"AWSMachine\" AWSMachine=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt-bootstrap\" namespace=\"openshift-cluster-api-guests\" 
name=\"ci-op-f04mmlsn-88881-49rgt-bootstrap\" 
reconcileID=\"825b335b-569b-4a16-891d-c6b7fb5a9db6\" 
machine=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt-bootstrap\" 
cluster=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt\" 
eip=\"eipalloc-0678f0da2c1ecd771\" eip-address=\"157.254.217.22\" 
eip-associationID=\"eipassoc-0ea51b9c68cdfb171\" eip-instance=\"i-05af21b09d3552d0f\""

This PR is ready for review. PTAL? /assign @Ankitasw @dlipovetsky cc @r4f4 @nrb

/test pull-cluster-api-provider-aws-e2e-eks

Jul 04 '24 17:07 mtulio

e2e EKS test is passing.

I am holding this PR to address the @r4f4 's feedback to decrease the warnings in the code, silent re-queuing expected states.

/hold

Jul 05 '24 00:07 mtulio

Looks like all jobs are green! :) Downstream/OpenShift jobs are passing too: 1828801695790927872, 1828558006120353792

wget -O /tmp/install.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/8676/pull-ci-openshift-installer-master-altinfra-e2e-aws-ovn-public-ipv4-pool/1828801695790927872/artifacts/e2e-aws-ovn-public-ipv4-pool/ipi-install-install/artifacts/.openshift_install-1724860949.log

$ grep -c 'Reconciling machine with custom Public IPv4 Pool' /tmp/install.log
12

$ grep -c 'Found instance in pending state while reconciling publicIpv4Pool, requeue' /tmp/install.log
2

$ grep -c 'Failed to reconcile BYO Public IPv4' /tmp/install.log
0

$ grep -c 'Unable to reconcile Elastic IP Pool for instance' /tmp/install.log
2

$ grep -c 'error checking if addresses exists for Elastic IP Pool to machine' /tmp/install.log
0

$ grep -c 'unexpected number of EIPs allocated to the role. expected 1' /tmp/install.log
0

$ grep -c 'Machine is already associated with an Elastic IP with custom Public IPv4 pool' /tmp/install.log
9

$ grep -c 'failed to reconcile Elastic IP' /tmp/install.log
0

Starting the project e2e: /test pull-cluster-api-provider-aws-e2e

Aug 28 '24 18:08 mtulio

LGTM

Aug 28 '24 20:08 r4f4

/test pull-cluster-api-provider-aws-e2e

Aug 28 '24 20:08 mtulio

/test pull-cluster-api-provider-aws-e2e

still hitting infra issues/timeouts:

/test pull-cluster-api-provider-aws-e2e /test pull-cluster-api-provider-aws-e2e-eks

Aug 29 '24 03:08 mtulio

/retest

Will re-review today.

Aug 29 '24 14:08 nrb

/retest

Aug 29 '24 16:08 nrb

Still seeing e2e issues that seems to be unrelated. Thanks @nrb for re-triggering. Looks like e2e-eks is passing now.

I will trigger the regular e2e again:

/test pull-cluster-api-provider-aws-e2e

Downstream openshift installer jobs are triggered too: https://github.com/openshift/installer/pull/8676#issuecomment-2318667965

Aug 29 '24 19:08 mtulio

/test pull-cluster-api-provider-aws-e2e

Aug 30 '24 04:08 mtulio

/test ?

Aug 30 '24 20:08 mtulio

@mtulio: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-build-docker
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-apidiff-main
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks
/test pull-cluster-api-provider-aws-e2e-eks-gc
/test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-apidiff-main
pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-build-docker
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 30 '24 20:08 k8s-ci-robot

/test pull-cluster-api-provider-aws-e2e-conformance

Aug 30 '24 20:08 mtulio

Hi @nrb , would you mind taking a review in this bug, please?

As you see in the last comments, we are struggling to run the job pull-cluster-api-provider-aws-e2e, but the e2e-eks, and ele-conformance are passing (not sure under the hood if it's all touching this change, but considering this change is changing Machine creation flow, it think it should).

Furthermore, in downstream/OpenShift we are running several presubmit jobs (Public IPv4 pool is default over aws jobs) across the PR https://github.com/openshift/installer/pull/8676 (vendoring this PR). I also introduced a new presubmit (https://github.com/openshift/release/pull/56114) to enforce to disable the pool in CAPA provisioning to test the non-pool flow and it is all passing: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8676/pull-ci-openshift-installer-master-e2e-aws-ovn-public-ipv4-pool-disabled/1829621227333881856

I will trigger again, but also open to hear from you if you could share your tougths of another job to validate it, or those presented is ok.

Looking forward to hear from you, thanks!

cc @r4f4 @patrickdillon

/test pull-cluster-api-provider-aws-e2e

Aug 31 '24 00:08 mtulio

/assign @nrb

Sep 02 '24 14:09 mtulio

/test pull-cluster-api-provider-aws-e2e

Sep 02 '24 14:09 mtulio

tl;dr: looks like pull-cluster-api-provider-aws-e2e is failing for unrelated change in this PR, it is trying to create a cluster with an AMI that does not exist in the test account.

After some investigation with @nrb, we are seeing the job e2e always failing the test [It] [unmanaged] [Cluster API Framework] Clusterctl Upgrade Spec [from latest v1beta1 release to v1beta2] Should create a management cluster and then upgrade all the providers .

We are seeing in the Control Plane spec that the AMI for kube 1.25 isn't available:

  - lastTransitionTime: "2024-09-02T15:00:37Z"
    message: 'failed to create AWSMachine instance: failed to find ami: found no AMIs
      with the name: "capa-ami-ubuntu-18.04-?1.25.0-*"'

Looks like the CAPI e2e[1] is setting the following variable:

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abe918cfe1038c714792860b23632e38c7e0438f/test/e2e/data/e2e_conf.yaml#L207-L209

in the test spec: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abe918cfe1038c714792860b23632e38c7e0438f/test/e2e/suites/unmanaged/unmanaged_CAPI_test.go#L112-L129

Causing the failures when looking up for an AMI that does not exists in the test account. (maybe had been pruned or 1.25 is not supported and dont need anymore?)

[1] https://github.com/kubernetes-sigs/cluster-api/blob/main/test/e2e/clusterctl_upgrade.go#L237C33-L237C58

Sep 03 '24 17:09 mtulio

For reviewers: this PR is general ready for review. Following my last comment, the failure is unrelated with this PR.

Sep 04 '24 19:09 mtulio

/override pull-cluster-api-provider-aws-e2e

This test is failing due to something unrelated right now.

Sep 04 '24 19:09 nrb

@nrb: nrb unauthorized: /override is restricted to Repo administrators.

In response to this:

/override pull-cluster-api-provider-aws-e2e

This test is failing due to something unrelated right now.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sep 04 '24 19:09 k8s-ci-robot

PR https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5118 merged, PR rebased to re-test the failed upgrade test. /test pull-cluster-api-provider-aws-e2e

Sep 09 '24 15:09 mtulio

I don't think the failure is related to this PR, nor the previous image issues we were seeing; the AWSMachine resources get to a Ready state and don't have errors themselves.

/retest

Sep 09 '24 18:09 nrb

@mtulio: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command pull-cluster-api-provider-aws-e2e 4626a6a link false /test pull-cluster-api-provider-aws-e2e Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

I am seeing a lot CloudFormation to provision required environment, I can't see if could be related.

/retest

Sep 10 '24 00:09 mtulio