cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

AWSMachine controller leaks EC2 instances when specifying a Name tag in `additionalTags`

Open mjlshen opened this issue 1 year ago • 8 comments

/kind bug

What steps did you take and what happened: After applying two infrastructure.cluster.x-k8s.io/v1beta2 AWSMachines with the following additionalTags:

spec:
  additionalTags:
    Name: sample-name-tag

124 EC2 instances were created in quick succession for these AWSMachines before .spec.providerID was populated. All of the EC2 instances (leaked and not-leaked) do have the tags Name: sample-name-tag

❯ k get awsmachine                                                  
NAME                                    CLUSTER                            STATE     READY   INSTANCEID                              MACHINE
sample-name-tag-251d8688-cjd7r   ${ID}   running   true    aws:///us-east-2a/i-019c51ee0ded3e51a   sample-name-tag-workers-55b788fd4dx5n4zb-kn2h6
sample-name-tag-251d8688-hwhb4   ${ID}   running   true    aws:///us-east-2a/i-0c9169fb1781d2a8e   sample-name-tag-workers-55b788fd4dx5n4zb-8brbs

What did you expect to happen: One EC2 instance to be created per AWSMachine, with the tag Name: sample-name-tag

Anything else you would like to add: I think this occurs because:

  1. We try to find an instance during a normal reconcile loop - https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/56c9a39dd834640ee4a027e679ad2a5757098dfd/controllers/awsmachine_controller.go#L483-L489
  2. We look for the instance with these tags, including the name of the AWSMachine https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abd444c27083d5054c772c7bf5101266da10f76f/pkg/cloud/services/ec2/instances.go#L46-L53
  3. If instance == nil, then a new EC2 instance is created https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/56c9a39dd834640ee4a027e679ad2a5757098dfd/controllers/awsmachine_controller.go#L517C29-L522
  4. However, this only happens when the instance cannot be found by tags (or an AWS not found/permission error). So the bug is with a change in v1beta2 to allow overwriting the Name tag: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/api/v1beta2/tags.go#L244-L264

vs v1beta1 https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/abd444c27083d5054c772c7bf5101266da10f76f/api/v1beta1/tags.go#L250-L269 where the AWSMachine's own .metadata.name had precedence

So I think we need to either not allow overwriting the Name tag or we need to be smarter about how we find the instance based off of tags. Even if we update the filtering logic to account for the additional Name tag, we don't want to find an instance that's already been claimed by another AWSMachine.

Environment:

  • Cluster-api-provider-aws version: v2.0.2
  • Kubernetes version: (use kubectl version):
Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.6+f67aeb3
  • OS (e.g. from /etc/os-release):

mjlshen avatar Nov 10 '23 17:11 mjlshen