cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

WIP: Do not allow overriding the `Name` tag so that the AWSMachine controller can reliably find its EC2 instance

Open mjlshen opened this issue 2 years ago • 13 comments
trafficstars

/kind bug

What this PR does / why we need it: This commit handles an edge-case where an error occurs after EC2 instance creation and before the output of a successful EC2 instance creation is returned. This causes the AWSMachine controller to attempt to find the EC2 instance by a set of filters instead of being able to explicitly find an EC2 instance by ID. Currently, the logic for finding an EC2 instance is not guaranteed to find the correct instance because one of the filters is by the Name tag, which can be overwritten by the user in the v1beta2 API.

In essence, this PR reverts #3991 because one of the assumptions in the backing issue, #3989:

As far as I can see, that tag isn't used anywhere in the codebase for matching or looking up of subnets. It seem purely for display.

Was true for subnets, but is not true for EC2 instances.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #4629

Special notes for your reviewer: I'm not sure if this is the right way to solve this problem, two other possibilities I thought of are:

  • Allowing users to override the Name tag for subnets, but not for EC2 instances
  • Making findInstances more idempotent when overriding the Name tag https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abd444c27083d5054c772c7bf5101266da10f76f/pkg/cloud/services/ec2/instances.go#L46-L53 this could be done by adding a new/unique CAPA-specific "name" tag so that findInstances can filter on that new tag instead of a Name tag that a user can overwrite, resulting in a unique EC2 instance match.

Checklist:

  • [x] squashed commits
  • [ ] includes documentation
  • [ ] adds unit tests
  • [ ] adds or updates e2e tests

Release note:

Disallow users from setting the `Name` tag on created AWS resources

mjlshen avatar Nov 13 '23 03:11 mjlshen

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign fabriziopandini for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Nov 13 '23 03:11 k8s-ci-robot

Thanks @mjlshen! This is awesome but I don't think it handles all the edge cases where the providerID might still not be reliable. E.g. aws API eventual consistency [1], process crash right after the API call... So the tags fallback must be trustable. The name of the replicas under a scalable resource is an implementation detail and letting it being overridden is a bug. I think we should consider reverting the original change that enabled it to be overridden, or if there's still justification for the network use case which drove that change, then we need to differentiate so this is not allowed for instances.

[1] https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency

enxebre avatar Nov 13 '23 08:11 enxebre

@AndiDog my PR is essentially reverting your PR for #3989 - do you have a use-case for setting the Name tag on subnets?

mjlshen avatar Nov 14 '23 03:11 mjlshen

Users should be able to set the Name tag since like all non AWS-internal tags, it has no specific meaning apart from display purposes (e.g. AWS Console). That goes for EC2 instances, subnets and probably most other AWS resources.

If CAPA cannot find the instance correctly because it is tagged with {AWSCluster,AWSMachine}.spec.additionalTags["Name"], as expected, instead of scope.Name() (which currently returns AWSMachine.Name), then that's the bug and the find-by-name functionality needs to be fixed instead of forbidding overrides of the value. I think the tagging code is in func (r *AWSMachineReconciler) ensureTags, but I didn't read this complex function and assume it uses the additional tags map to set the Name, and does not make use of scope.Name(). If that's correct, then filter.EC2.Name(scope.Name()), is wrong and needs fixing plus a unit test to cover the case where the name tag is overridden.

AndiDog avatar Nov 15 '23 08:11 AndiDog

Ah the Name tag is getting set correctly on EC2 instances, but it's the same for every EC2 instance, so findInstances can no longer uniquely find an instance in certain circumstances.

I think this means we need to introduce a new tag to uniquely link an EC2 instance to an AWS machine CR

mjlshen avatar Nov 15 '23 15:11 mjlshen

True, that's definitely a problem since we only reconcile a single object and can't reason about others (such as "3 AWSMachine objects with same name tag").

Can you point me to the code which sets the EC2 instance name if no AdditionalTags["name"] is given?

AndiDog avatar Nov 15 '23 18:11 AndiDog

Sorry for the delayed response - yes!

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/b003e69199b3b598a1c0f26a5726466d0a331fb7/pkg/cloud/services/ec2/instances.go#L123-L131 sets the EC2 instance name on creation

Where infrav1.Build() is the same Build function to assemble tag precedence: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abd444c27083d5054c772c7bf5101266da10f76f/api/v1beta2/tags.go#L243-L264 (where an additionally specified Name tag now has precedence over the AWSMachine .metadata.name in v1beta2 now)

mjlshen avatar Nov 19 '23 20:11 mjlshen

It would be good to get @AverageMarcus input on this as well.

richardcase avatar Nov 20 '23 08:11 richardcase

It's been quite a while since I've been deeply involved in CAPA code so I'm gonna keep this high level...

My general thought is that the users should be able to configure any of the tags they like, including Name, so that they can adhere to any possible tag policies they have in their organisation. If we're needing to use tags for any logic based decisions we should always be using CAPA-specific tag keys so instead of using the Name tag we would instead use something like sigs.k8s.io/cluster-api-provider-aws/MachineName.

Some background on the initial introduction of the configurable Name tag:

  • PR: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3991
  • Issue: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/3989

The driving force behind the initial change for us (Giant Swarm) is we wanted to allow for some complex network configurations where different machine pools could have different subnets and wanted to have control over how those subnets were named so they'd be identifiable. I forget what exactly the problem was with them all having the same name but if I remember correctly it was related to configuring transit gateway attachments and not being able to pick the correct subnet.

AverageMarcus avatar Nov 20 '23 09:11 AverageMarcus

Just in case my comment got lost on the bug report: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/4629#issuecomment-1815033857

I agree with @AverageMarcus

cnmcavoy avatar Nov 20 '23 19:11 cnmcavoy

Yeah, a new tag makes sense to me too, given that we want to be able to override a Name tag. I will update this PR to go in a new direction and introduce a new CAPA-specific tag based on AWSMachine name, thanks for all the feedback!

mjlshen avatar Nov 20 '23 20:11 mjlshen

@mjlshen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-build-docker a93e910323614f58839858a8139adbbfb739b311 link true /test pull-cluster-api-provider-aws-build-docker

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Feb 29 '24 12:02 k8s-ci-robot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 24 '24 05:04 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 23 '24 06:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 22 '24 06:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Sep 21 '24 07:09 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Sep 21 '24 07:09 k8s-ci-robot