🐛: ec2/byoip: fix EIP leak when creating machine
What type of PR is this?
/kind bug
What this PR does / why we need it:
The instance creation flow is creating by default EIP to instances even if the BYO IP flow is set. BYO IPv4 creates and associates the EIP to instance after it is created, preventing paying for additional EIP (amazon-provided) when creating the instance when the BYO IPv4 Pool is defined to be used by the machine.
Furthermore, the fix provides additional checks to prevent duplicated EIP in the BYO IP reconciliation loop. The extra checks include running the EIP association many times, while the EIP is already associated, and failures in the log when running the EIP association prematurely - when the instance isn't ready, Eg ec2 in pending state.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5038
Special notes for your reviewer:
Checklist:
- [ ] squashed commits
- [ ] includes documentation (N/A)
- [x] includes emojis
- [ ] adds unit tests
- [ ] adds or updates e2e tests
Release note:
fix duplicated/leaked EIP when using BYO IPv4 on Machines.
Hi @mtulio. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/ok-to-test
/test ?
@mtulio: The following commands are available to trigger required jobs:
-
/test pull-cluster-api-provider-aws-build -
/test pull-cluster-api-provider-aws-build-docker -
/test pull-cluster-api-provider-aws-test -
/test pull-cluster-api-provider-aws-verify
The following commands are available to trigger optional jobs:
-
/test pull-cluster-api-provider-aws-apidiff-main -
/test pull-cluster-api-provider-aws-e2e -
/test pull-cluster-api-provider-aws-e2e-blocking -
/test pull-cluster-api-provider-aws-e2e-clusterclass -
/test pull-cluster-api-provider-aws-e2e-conformance -
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts -
/test pull-cluster-api-provider-aws-e2e-eks -
/test pull-cluster-api-provider-aws-e2e-eks-gc -
/test pull-cluster-api-provider-aws-e2e-eks-testing
Use /test all to run the following jobs that were automatically triggered:
-
pull-cluster-api-provider-aws-apidiff-main -
pull-cluster-api-provider-aws-build -
pull-cluster-api-provider-aws-build-docker -
pull-cluster-api-provider-aws-test -
pull-cluster-api-provider-aws-verify
In response to this:
/test ?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/test pull-cluster-api-provider-aws-e2e
Premature failure.
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e
Okay, previous test failures were flake. The latest run pass. OpenShift e2e BYOIP test is also passing install:
time="2024-06-28T19:16:32Z" level=debug msg="E0628 19:16:32.239704 331
awsmachine_controller.go:544] \"failed to reconcile BYO Public IPv4\"
err=\"unable to reconcile Elastic IP Pool to instance \\\"i-05af21b09d3552d0f\\\" with state: pending\""
[...]
time="2024-06-28T19:16:50Z" level=debug msg="I0628 19:16:50.432073 331 eip.go:44]
\"machine is already associated with an Elastic IP with custom Public IPv4 pool\"
controller=\"awsmachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\"
controllerKind=\"AWSMachine\" AWSMachine=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt-bootstrap\" namespace=\"openshift-cluster-api-guests\"
name=\"ci-op-f04mmlsn-88881-49rgt-bootstrap\"
reconcileID=\"825b335b-569b-4a16-891d-c6b7fb5a9db6\"
machine=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt-bootstrap\"
cluster=\"openshift-cluster-api-guests/ci-op-f04mmlsn-88881-49rgt\"
eip=\"eipalloc-0678f0da2c1ecd771\" eip-address=\"157.254.217.22\"
eip-associationID=\"eipassoc-0ea51b9c68cdfb171\" eip-instance=\"i-05af21b09d3552d0f\""
This PR is ready for review. PTAL? /assign @Ankitasw @dlipovetsky cc @r4f4 @nrb
/test pull-cluster-api-provider-aws-e2e-eks
e2e EKS test is passing.
I am holding this PR to address the @r4f4 's feedback to decrease the warnings in the code, silent re-queuing expected states.
/hold
Looks like all jobs are green! :) Downstream/OpenShift jobs are passing too: 1828801695790927872, 1828558006120353792
wget -O /tmp/install.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/8676/pull-ci-openshift-installer-master-altinfra-e2e-aws-ovn-public-ipv4-pool/1828801695790927872/artifacts/e2e-aws-ovn-public-ipv4-pool/ipi-install-install/artifacts/.openshift_install-1724860949.log
$ grep -c 'Reconciling machine with custom Public IPv4 Pool' /tmp/install.log
12
$ grep -c 'Found instance in pending state while reconciling publicIpv4Pool, requeue' /tmp/install.log
2
$ grep -c 'Failed to reconcile BYO Public IPv4' /tmp/install.log
0
$ grep -c 'Unable to reconcile Elastic IP Pool for instance' /tmp/install.log
2
$ grep -c 'error checking if addresses exists for Elastic IP Pool to machine' /tmp/install.log
0
$ grep -c 'unexpected number of EIPs allocated to the role. expected 1' /tmp/install.log
0
$ grep -c 'Machine is already associated with an Elastic IP with custom Public IPv4 pool' /tmp/install.log
9
$ grep -c 'failed to reconcile Elastic IP' /tmp/install.log
0
Starting the project e2e: /test pull-cluster-api-provider-aws-e2e
LGTM
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e
still hitting infra issues/timeouts:
/test pull-cluster-api-provider-aws-e2e /test pull-cluster-api-provider-aws-e2e-eks
/retest
Will re-review today.
/retest
Still seeing e2e issues that seems to be unrelated. Thanks @nrb for re-triggering. Looks like e2e-eks is passing now.
I will trigger the regular e2e again:
/test pull-cluster-api-provider-aws-e2e
Downstream openshift installer jobs are triggered too: https://github.com/openshift/installer/pull/8676#issuecomment-2318667965
/test pull-cluster-api-provider-aws-e2e
/test ?
@mtulio: The following commands are available to trigger required jobs:
-
/test pull-cluster-api-provider-aws-build -
/test pull-cluster-api-provider-aws-build-docker -
/test pull-cluster-api-provider-aws-test -
/test pull-cluster-api-provider-aws-verify
The following commands are available to trigger optional jobs:
-
/test pull-cluster-api-provider-aws-apidiff-main -
/test pull-cluster-api-provider-aws-e2e -
/test pull-cluster-api-provider-aws-e2e-blocking -
/test pull-cluster-api-provider-aws-e2e-clusterclass -
/test pull-cluster-api-provider-aws-e2e-conformance -
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts -
/test pull-cluster-api-provider-aws-e2e-eks -
/test pull-cluster-api-provider-aws-e2e-eks-gc -
/test pull-cluster-api-provider-aws-e2e-eks-testing
Use /test all to run the following jobs that were automatically triggered:
-
pull-cluster-api-provider-aws-apidiff-main -
pull-cluster-api-provider-aws-build -
pull-cluster-api-provider-aws-build-docker -
pull-cluster-api-provider-aws-test -
pull-cluster-api-provider-aws-verify
In response to this:
/test ?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/test pull-cluster-api-provider-aws-e2e-conformance
Hi @nrb , would you mind taking a review in this bug, please?
As you see in the last comments, we are struggling to run the job pull-cluster-api-provider-aws-e2e, but the e2e-eks, and ele-conformance are passing (not sure under the hood if it's all touching this change, but considering this change is changing Machine creation flow, it think it should).
Furthermore, in downstream/OpenShift we are running several presubmit jobs (Public IPv4 pool is default over aws jobs) across the PR https://github.com/openshift/installer/pull/8676 (vendoring this PR). I also introduced a new presubmit (https://github.com/openshift/release/pull/56114) to enforce to disable the pool in CAPA provisioning to test the non-pool flow and it is all passing: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8676/pull-ci-openshift-installer-master-e2e-aws-ovn-public-ipv4-pool-disabled/1829621227333881856
I will trigger again, but also open to hear from you if you could share your tougths of another job to validate it, or those presented is ok.
Looking forward to hear from you, thanks!
cc @r4f4 @patrickdillon
/test pull-cluster-api-provider-aws-e2e
/assign @nrb
/test pull-cluster-api-provider-aws-e2e
tl;dr: looks like pull-cluster-api-provider-aws-e2e is failing for unrelated change in this PR, it is trying to create a cluster with an AMI that does not exist in the test account.
After some investigation with @nrb, we are seeing the job e2e always failing the test [It] [unmanaged] [Cluster API Framework] Clusterctl Upgrade Spec [from latest v1beta1 release to v1beta2] Should create a management cluster and then upgrade all the providers .
We are seeing in the Control Plane spec that the AMI for kube 1.25 isn't available:
- lastTransitionTime: "2024-09-02T15:00:37Z"
message: 'failed to create AWSMachine instance: failed to find ami: found no AMIs
with the name: "capa-ami-ubuntu-18.04-?1.25.0-*"'
Looks like the CAPI e2e[1] is setting the following variable:
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abe918cfe1038c714792860b23632e38c7e0438f/test/e2e/data/e2e_conf.yaml#L207-L209
in the test spec: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abe918cfe1038c714792860b23632e38c7e0438f/test/e2e/suites/unmanaged/unmanaged_CAPI_test.go#L112-L129
Causing the failures when looking up for an AMI that does not exists in the test account. (maybe had been pruned or 1.25 is not supported and dont need anymore?)
[1] https://github.com/kubernetes-sigs/cluster-api/blob/main/test/e2e/clusterctl_upgrade.go#L237C33-L237C58
For reviewers: this PR is general ready for review. Following my last comment, the failure is unrelated with this PR.
/override pull-cluster-api-provider-aws-e2e
This test is failing due to something unrelated right now.
@nrb: nrb unauthorized: /override is restricted to Repo administrators.
In response to this:
/override pull-cluster-api-provider-aws-e2e
This test is failing due to something unrelated right now.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
PR https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5118 merged, PR rebased to re-test the failed upgrade test. /test pull-cluster-api-provider-aws-e2e
I don't think the failure is related to this PR, nor the previous image issues we were seeing; the AWSMachine resources get to a Ready state and don't have errors themselves.
/retest
@mtulio: The following test failed, say
/retestto rerun all failed tests or/retest-requiredto rerun all mandatory failed tests:Test name Commit Details Required Rerun command pull-cluster-api-provider-aws-e2e 4626a6a link false
/test pull-cluster-api-provider-aws-e2eFull PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
I am seeing a lot CloudFormation to provision required environment, I can't see if could be related.
/retest