cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
WIP: Do not allow overriding the `Name` tag so that the AWSMachine controller can reliably find its EC2 instance
/kind bug
What this PR does / why we need it:
This commit handles an edge-case where an error occurs after EC2 instance creation and before the output of a successful EC2 instance creation is returned. This causes the AWSMachine controller to attempt to find the EC2 instance by a set of filters instead of being able to explicitly find an EC2 instance by ID. Currently, the logic for finding an EC2 instance is not guaranteed to find the correct instance because one of the filters is by the Name tag, which can be overwritten by the user in the v1beta2 API.
In essence, this PR reverts #3991 because one of the assumptions in the backing issue, #3989:
As far as I can see, that tag isn't used anywhere in the codebase for matching or looking up of subnets. It seem purely for display.
Was true for subnets, but is not true for EC2 instances.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4629
Special notes for your reviewer: I'm not sure if this is the right way to solve this problem, two other possibilities I thought of are:
- Allowing users to override the
Nametag for subnets, but not for EC2 instances - Making
findInstancesmore idempotent when overriding theNametag https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abd444c27083d5054c772c7bf5101266da10f76f/pkg/cloud/services/ec2/instances.go#L46-L53 this could be done by adding a new/unique CAPA-specific "name" tag so thatfindInstancescan filter on that new tag instead of aNametag that a user can overwrite, resulting in a unique EC2 instance match.
Checklist:
- [x] squashed commits
- [ ] includes documentation
- [ ] adds unit tests
- [ ] adds or updates e2e tests
Release note:
Disallow users from setting the `Name` tag on created AWS resources
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign fabriziopandini for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Thanks @mjlshen! This is awesome but I don't think it handles all the edge cases where the providerID might still not be reliable. E.g. aws API eventual consistency [1], process crash right after the API call... So the tags fallback must be trustable. The name of the replicas under a scalable resource is an implementation detail and letting it being overridden is a bug. I think we should consider reverting the original change that enabled it to be overridden, or if there's still justification for the network use case which drove that change, then we need to differentiate so this is not allowed for instances.
[1] https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency
@AndiDog my PR is essentially reverting your PR for #3989 - do you have a use-case for setting the Name tag on subnets?
Users should be able to set the Name tag since like all non AWS-internal tags, it has no specific meaning apart from display purposes (e.g. AWS Console). That goes for EC2 instances, subnets and probably most other AWS resources.
If CAPA cannot find the instance correctly because it is tagged with {AWSCluster,AWSMachine}.spec.additionalTags["Name"], as expected, instead of scope.Name() (which currently returns AWSMachine.Name), then that's the bug and the find-by-name functionality needs to be fixed instead of forbidding overrides of the value. I think the tagging code is in func (r *AWSMachineReconciler) ensureTags, but I didn't read this complex function and assume it uses the additional tags map to set the Name, and does not make use of scope.Name(). If that's correct, then filter.EC2.Name(scope.Name()), is wrong and needs fixing plus a unit test to cover the case where the name tag is overridden.
Ah the Name tag is getting set correctly on EC2 instances, but it's the same for every EC2 instance, so findInstances can no longer uniquely find an instance in certain circumstances.
I think this means we need to introduce a new tag to uniquely link an EC2 instance to an AWS machine CR
True, that's definitely a problem since we only reconcile a single object and can't reason about others (such as "3 AWSMachine objects with same name tag").
Can you point me to the code which sets the EC2 instance name if no AdditionalTags["name"] is given?
Sorry for the delayed response - yes!
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/b003e69199b3b598a1c0f26a5726466d0a331fb7/pkg/cloud/services/ec2/instances.go#L123-L131 sets the EC2 instance name on creation
Where infrav1.Build() is the same Build function to assemble tag precedence: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abd444c27083d5054c772c7bf5101266da10f76f/api/v1beta2/tags.go#L243-L264 (where an additionally specified Name tag now has precedence over the AWSMachine .metadata.name in v1beta2 now)
It would be good to get @AverageMarcus input on this as well.
It's been quite a while since I've been deeply involved in CAPA code so I'm gonna keep this high level...
My general thought is that the users should be able to configure any of the tags they like, including Name, so that they can adhere to any possible tag policies they have in their organisation. If we're needing to use tags for any logic based decisions we should always be using CAPA-specific tag keys so instead of using the Name tag we would instead use something like sigs.k8s.io/cluster-api-provider-aws/MachineName.
Some background on the initial introduction of the configurable Name tag:
- PR: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3991
- Issue: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/3989
The driving force behind the initial change for us (Giant Swarm) is we wanted to allow for some complex network configurations where different machine pools could have different subnets and wanted to have control over how those subnets were named so they'd be identifiable. I forget what exactly the problem was with them all having the same name but if I remember correctly it was related to configuring transit gateway attachments and not being able to pick the correct subnet.
Just in case my comment got lost on the bug report: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/4629#issuecomment-1815033857
I agree with @AverageMarcus
Yeah, a new tag makes sense to me too, given that we want to be able to override a Name tag. I will update this PR to go in a new direction and introduce a new CAPA-specific tag based on AWSMachine name, thanks for all the feedback!
@mjlshen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-cluster-api-provider-aws-build-docker | a93e910323614f58839858a8139adbbfb739b311 | link | true | /test pull-cluster-api-provider-aws-build-docker |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Reopen this PR with
/reopen - Mark this PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closedYou can:
- Reopen this PR with
/reopen- Mark this PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.