cloud-provider-aws Managed security group leak after annotation `service.beta.kubernetes.io/aws-load-balancer-security-groups` added to existing service

What happened:

Security Group leak when annotation service.beta.kubernetes.io/aws-load-balancer-security-groups (BYO SG) is added to existing Service type-loadBalancer CLB.

What you expected to happen:

To be discussed with community:

Options I can see so far:

Security Group created by controller ("managed SG") is removed after LB reconciliation adding BYO SG to ELB/CLB (prior dependencies must be resolved such as removing backend's SG remove rules)
Controller not allow to update "managed SG" when an annotation is added to existing service (is the a validation webhook for the Service LB?)
any other?

How to reproduce it (as minimally and precisely as possible):

# Step 1: Create CLB service
SVC_NAME=$APP_NAME_BASE-clb-sg1
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Service
metadata:
  name: $SVC_NAME
  namespace: ${APP_NAMESPACE}
spec:
  selector:
    app: $APP_NAME_BASE
  ports:
    - name: http80
      port: 80
      targetPort: 8080
      protocol: TCP
  type: LoadBalancer
EOF

# Step 2: Check the CLB SG
LB_DNS=$(kubectl get svc $SVC_NAME -n ${APP_NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

$ aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").SecurityGroups"
[
  "sg-022da65a3d6e1d9ba"
]

SG_ID_ORIGINAL=$(aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").SecurityGroups" --output text)

# Step 3: Create a SG to be added ot BYO SG annotation
CLUSTER_ID=$(kubectl get infrastructure cluster -o jsonpath='{.status.infrastructureName}')

VPC_ID=$(aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").VPCId")

SG_NAME="${SVC_NAME}-byosg"
SG_ID=$(aws ec2 create-security-group \
--vpc-id="${VPC_ID}" \
--group-name="${SG_NAME}" \
--description="BYO SG sample for service ${SVC_NAME}" \
--tag-specifications "ResourceType=security-group,Tags=[{Key=Name,Value=${SG_NAME}},{Key=kubernetes.io/cluster/${CLUSTER_ID},Value=shared}]" \
| tee -a | jq -r .GroupId)

$ echo $SG_ID
sg-019ce8390024660f0

# Step 4: Patch the service with BYO SG
kubectl patch service ${SVC_NAME} -n ${APP_NAMESPACE} --type=merge \
  --patch '{"metadata":{"annotations":{"service.beta.kubernetes.io/aws-load-balancer-security-groups":"'$SG_ID'"}}}'

# Step 5: Check SG has been added to CLB
$ kubectl get service ${SVC_NAME} -n ${APP_NAMESPACE} -o yaml | yq4 ea .metadata.annotations
service.beta.kubernetes.io/aws-load-balancer-security-groups: sg-019ce8390024660f0

$ aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").SecurityGroups"
[
  "sg-019ce8390024660f0"
]

$ aws ec2 describe-network-interfaces \
    --filters "Name=group-id,Values=$SG_ID" \
    --output json | jq -r '.NetworkInterfaces[] | "\(.NetworkInterfaceId) - \(.Description) - \(.Status)"'
eni-01667c7da28262a73 - ELB a341df09f30f94b78b5c33371eec8bac - in-use
eni-0fbdb2e1df215869c - ELB a341df09f30f94b78b5c33371eec8bac - in-use

# Step 6: Check if original SG has not been cleaned (bug/leaked)

## SG exists / not deleted
$ aws ec2 describe-security-groups --group-ids $SG_ID_ORIGINAL \
    --query 'SecurityGroups[].{"name":GroupName, "tags":Tags}' \
    --output json
[
    {
        "name": "k8s-elb-a341df09f30f94b78b5c33371eec8bac",
        "tags": [
            {
                "Key": "KubernetesCluster",
                "Value": "mrb-sg-xknls"
            },
            {
                "Key": "kubernetes.io/cluster/mrb-sg-xknls",
                "Value": "owned"
            }
        ]
    }
]

## SG not linked to any ENI

$ aws ec2 describe-network-interfaces \
--filters "Name=group-id,Values=$SG_ID_ORIGINAL" \
--output json \
| jq -r '.NetworkInterfaces[] | "\(.NetworkInterfaceId) - \(.Description) - \(.Status)"'
<empty>

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.33
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

/kind bug

Jul 11 '25 23:07 mtulio

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 11 '25 23:07 k8s-ci-robot

cc @elmiko @joelspeed

Jul 11 '25 23:07 mtulio

@mtulio i'm going to leave this with the needs-triage label for now, we will discuss this issue at the sig meeting on wednesday.

Jul 14 '25 13:07 elmiko

IIUC, the ELB SG name is generated based on a hash of the name of the service, therefore, shouldn't we be able to check for the existence of a SG for the service independent of the annotation presence or not?

I guess at the moment, when the annotation is present, we skip the SG cleanup logic altogether? Perhaps that was an incorrect decision if I am right

Jul 14 '25 14:07 JoelSpeed

IIUC, the ELB SG name is generated based on a hash of the name of the service, therefore, shouldn't we be able to check for the existence of a SG for the service independent of the annotation presence or not?

I guess at the moment, when the annotation is present, we skip the SG cleanup logic altogether? Perhaps that was an incorrect decision if I am right

Correct, the SG is replaced in-place when incoming (BYO SG) differs from LB's SG (actual)[1]. IMHO the challenge on clean up the "managed SG" would be identifying that the actual was created by the controller (and not by an BYO SG). I was thinking if we can can use the cluster tag with owned value of SG for that validation?

[1] https://github.com/kubernetes/cloud-provider-aws/blob/948e8f875773f6f5da072ea7ec1fa6726e94dff1/pkg/providers/v1/aws_loadbalancer.go#L1084-L1104

Jul 14 '25 15:07 mtulio

Cluster owned tag seems like a reasonably good assumption, but wouldn't we also check the name since that is predictable?

Jul 15 '25 10:07 JoelSpeed

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 13 '25 11:10 k8s-triage-robot

/remove-lifecycle stale

Oct 16 '25 20:10 elmiko