cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

CAPA applies AWS cloud-controller-manager tag to user-owned subnet, but does not delete the tag on cluster delete, leaving the subnet unusable by future clusters

Open dlipovetsky opened this issue 2 years ago • 19 comments

/kind bug

What steps did you take and what happened: The cloud-controller-manager (CCM) can only use tagged resources; if they're missing, it will not create the ELB to back the LoadBalancer-type Service.

The tag, which I'll refer to as the "CCM tag," is: kubernetes.io/cluster/<name>: [owned|shared]. Importantly, a resource may have exactly one such tag. Otherwise, the CCM will return an error.

CAPA applies the CCM tag to various resources that it owns. That's fine. However, with some recent changes on the main branch, CAPA applies this tag to some resources that the user owns. These resources outlive the cluster, and CAPA does not remove the CCM tag on cluster delete. Until the CCM tag is removed from them, these resources can't be used for other clusters. That's a problem.

Part of the reason CAPA does not remove the CCM tag from these resources is because it does not know whether it applied the CCM tag, or whether the user applied it themselves. Without storing some state (e.g. using yet another tag), CAPA cannot know this.

What did you expect to happen: CAPA should not apply the CCM tag to user-owned resources, or it should delete the CCM tag when reconciling the cluster delete.

Anything else you would like to add: CAPA begin to apply the CCM tag to a user-owned subnet in https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/2715.

Environment:

  • Cluster-api-provider-aws version: v1.5.1

https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/3854#tasklist-block-40624e9e-2c5e-4a87-bfe9-1b0cefc35b22

dlipovetsky avatar Nov 15 '22 20:11 dlipovetsky

Fixing this bug means changing tagging behavior that's been present since v1.0. The possible solutions:

(a) CAPA stops applying the CCM tag to user-owned subnets (b) CAPA applies (on cluster create) and removes (on cluster delete) the CCM tag to user-owned subnets, security groups, and route tables.

If we cannot do either (a) or (b) in a minor release, then I think we need to address this issue before the v2.0.0.

dlipovetsky avatar Nov 15 '22 20:11 dlipovetsky

Thanks for creating this for discussion @dlipovetsky. My thoughts are:

(a) we can't stop applying the tag to user-owned subnets because this was a specific feature request (#2584, from @scottslowe). Reading the original request, this seems like a reasonable feature to have imho. (b) If we add the CCM tag to anything, then we should be removing the added tag when we delete the cluster. And as you've pointed out, we'd need to know that CAPA added the tag and not the user (by using another tag?).

I think that (b) is the way to go out of the 2 options.

If we cannot do either (a) or (b) in a minor release, then I think we need to address this issue before the v2.0.0.

If we go with either option, then we are not changing the API and so from that point of view changing in a minor version is OK. The question is then are we changing the behaviour in a non-backwards compatible way? I think as long as we have a way to determine that we added the CCM tag and only act on delete when we see we added the tag, then i think the behaviour stays the same for existing clusters but changes for newly created clusters. If we then start adding/deleting ccm tags on sg and route tables this is new behaviour which we add in minor versions.

That's a long way of saying that I think we can do (b) in a minor version bump, so v2.1.0

What does everyone else think?

richardcase avatar Nov 16 '22 08:11 richardcase

And from the discussions on slack (https://kubernetes.slack.com/archives/CD6U2V71N/p1668469376275609 and https://kubernetes.slack.com/archives/CD6U2V71N/p1668587787725479):

  • should we even be adding the tag to BYO infra? (@MarcusNoble)

If we reverted the behaviour of adding the tags to the subnets then this is a change in behaviour that was requested in #2584.

richardcase avatar Nov 16 '22 09:11 richardcase

If we did want to remove the tagging of user managed resources we could introduce this via a feature flag?

richardcase avatar Nov 16 '22 09:11 richardcase

Even I am not in favor of option (a) because of the same reason listed by @richardcase. So option (b) seems to be the way forward to me as well.

should we even be adding the tag to BYO infra? (@MarcusNoble)

Since this is a supported behavior, IMO we should not change it at this point. And we have seen users benefitting from this, so -1 from me.

Ankitasw avatar Nov 16 '22 09:11 Ankitasw

I was unaware of the previous user request to support this. While I’d still prefer if user provided infra was wholly set up by the user, I can see it is beneficial to others already so I agree to go with option B.

AverageMarcus avatar Nov 16 '22 10:11 AverageMarcus

@MarcusNoble @Ankitasw - if we had an option (c) which was to gather details of usage of this current behaviour from people like @scottslowe and then consider option (a) in the future - if we give notice of deprecation of this behaviour to give time for users to change their usage and update any automation. Would this change your thoughts?

richardcase avatar Nov 16 '22 10:11 richardcase

Just catching up on the issue.

Regarding the PR that added CCM tags to BYOI subnets, I implemented them per a user request and it was only done for subnets per doc, meaning there were no mention of security groups or other resources.

Agree with @dlipovetsky - we should have deleted the tags CAPA added to the subnets - it was one of my early PRs in CAPA so haven't thought about it too carefully.

Now, regarding options suggested

(a) CAPA stops applying the CCM tag to user-owned subnets (b) CAPA applies (on cluster create) and removes (on cluster delete) the CCM tag to user-owned subnets, security groups, and route tables.

I am ok with either options as long as the pattern of applying/removing CCM tags are consistent across resources. The only thing is that I know some of our customers rely on (a) right now so we will need to give advance warning/time for them.

pydctw avatar Nov 16 '22 16:11 pydctw

catching up on the thread, I think (b) is reasonable, I don't think we can consider it as an API breaking change so I think 2.1.0 would make sense

yastij avatar Nov 16 '22 16:11 yastij

Also just catching up (I'm on holiday in Spain): Since my request seems to be the culprit that started a portion of this, I figured I should weigh in. 😄

The use case for me was (and is) that I use an infrastructure-as-code tool to stand up a set of shared infrastructure (VPC, subnets, route tables, gateways, etc.) that are then used/leveraged by multiple workload clusters. It's possible that I could modify my IaC code to better handle the tags, but it would be far better if CAPA could manage the tags as part of cluster lifecycle. For me, therefore, option B (removing tags on cluster delete) would be preferable/ideal.

I'm happy to provide additional information if needed.

scottslowe avatar Nov 16 '22 17:11 scottslowe

Thanks all for your thoughtful comments!

It sounds like we agree to go with

(b) CAPA applies (on cluster create) and removes (on cluster delete) the CCM tag to user-owned subnets, security groups, and route tables.

And we agree that the remaining work () can be addressed in a 2.x minor release.

  1. Apply the CCM tag (kubernetes.io/cluster/<name>: shared) to user-owned security groups, route tables, and any other resources that the CCM might use.
  2. Remove the CCM tag from user-owned resources when reconciling cluster delete.

I will create GH issues for these.

Update: I created #3860 for route tables, and amended #3481 for security groups.

dlipovetsky avatar Nov 16 '22 17:11 dlipovetsky

I’d still prefer if user provided infra was wholly set up by the user

@AverageMarcus I recognize your concern. But I'm not sure that's what users expect. For example, for every LoadBalancer-type Service, the CCM itself creates a rule in the user-owned security group. Users don't expect to create these rules on behalf of the CCM.

Moreover, without the CCM tag applied to user-owned resources, the CCM will fail to function. Since the CCM tag is specific to the cluster, it should be applied on cluster create, and removed on cluster delete. While we can make the user responsible for these two actions, it seems like CAPA is in a good position to make them on behalf of the user.

I would like to rely on the "principle of least surprise" here: Let's document why CAPA applies the CCM tag, namely, to ensure CCM can do its job with user-owned resources, and why CAPA removes this tag on cluster delete.

dlipovetsky avatar Nov 16 '22 18:11 dlipovetsky

/triage accepted /priority important-longterm

dlipovetsky avatar Nov 17 '22 20:11 dlipovetsky

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Jan 19 '24 07:01 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 18 '24 08:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 18 '24 08:05 k8s-triage-robot

/remove-lifecycle rotten

scottslowe avatar May 18 '24 11:05 scottslowe

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 16 '24 12:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 15 '24 13:09 k8s-triage-robot