cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

Managed security group leak after annotation `service.beta.kubernetes.io/aws-load-balancer-security-groups` added to existing service

Open mtulio opened this issue 5 months ago • 8 comments

What happened:

Security Group leak when annotation service.beta.kubernetes.io/aws-load-balancer-security-groups (BYO SG) is added to existing Service type-loadBalancer CLB.

What you expected to happen:

To be discussed with community:

Options I can see so far:

  1. Security Group created by controller ("managed SG") is removed after LB reconciliation adding BYO SG to ELB/CLB (prior dependencies must be resolved such as removing backend's SG remove rules)
  2. Controller not allow to update "managed SG" when an annotation is added to existing service (is the a validation webhook for the Service LB?)
  3. any other?

How to reproduce it (as minimally and precisely as possible):

# Step 1: Create CLB service
SVC_NAME=$APP_NAME_BASE-clb-sg1
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Service
metadata:
  name: $SVC_NAME
  namespace: ${APP_NAMESPACE}
spec:
  selector:
    app: $APP_NAME_BASE
  ports:
    - name: http80
      port: 80
      targetPort: 8080
      protocol: TCP
  type: LoadBalancer
EOF

# Step 2: Check the CLB SG
LB_DNS=$(kubectl get svc $SVC_NAME -n ${APP_NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

$ aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").SecurityGroups"
[
  "sg-022da65a3d6e1d9ba"
]

SG_ID_ORIGINAL=$(aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").SecurityGroups" --output text)

# Step 3: Create a SG to be added ot BYO SG annotation
CLUSTER_ID=$(kubectl get infrastructure cluster -o jsonpath='{.status.infrastructureName}')

VPC_ID=$(aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").VPCId")

SG_NAME="${SVC_NAME}-byosg"
SG_ID=$(aws ec2 create-security-group \
--vpc-id="${VPC_ID}" \
--group-name="${SG_NAME}" \
--description="BYO SG sample for service ${SVC_NAME}" \
--tag-specifications "ResourceType=security-group,Tags=[{Key=Name,Value=${SG_NAME}},{Key=kubernetes.io/cluster/${CLUSTER_ID},Value=shared}]" \
| tee -a | jq -r .GroupId)

$ echo $SG_ID
sg-019ce8390024660f0

# Step 4: Patch the service with BYO SG
kubectl patch service ${SVC_NAME} -n ${APP_NAMESPACE} --type=merge \
  --patch '{"metadata":{"annotations":{"service.beta.kubernetes.io/aws-load-balancer-security-groups":"'$SG_ID'"}}}'

# Step 5: Check SG has been added to CLB
$ kubectl get service ${SVC_NAME} -n ${APP_NAMESPACE} -o yaml | yq4 ea .metadata.annotations
service.beta.kubernetes.io/aws-load-balancer-security-groups: sg-019ce8390024660f0

$ aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").SecurityGroups"
[
  "sg-019ce8390024660f0"
]

$ aws ec2 describe-network-interfaces \
    --filters "Name=group-id,Values=$SG_ID" \
    --output json | jq -r '.NetworkInterfaces[] | "\(.NetworkInterfaceId) - \(.Description) - \(.Status)"'
eni-01667c7da28262a73 - ELB a341df09f30f94b78b5c33371eec8bac - in-use
eni-0fbdb2e1df215869c - ELB a341df09f30f94b78b5c33371eec8bac - in-use

# Step 6: Check if original SG has not been cleaned (bug/leaked)

## SG exists / not deleted
$ aws ec2 describe-security-groups --group-ids $SG_ID_ORIGINAL \
    --query 'SecurityGroups[].{"name":GroupName, "tags":Tags}' \
    --output json
[
    {
        "name": "k8s-elb-a341df09f30f94b78b5c33371eec8bac",
        "tags": [
            {
                "Key": "KubernetesCluster",
                "Value": "mrb-sg-xknls"
            },
            {
                "Key": "kubernetes.io/cluster/mrb-sg-xknls",
                "Value": "owned"
            }
        ]
    }
]

## SG not linked to any ENI

$ aws ec2 describe-network-interfaces \
--filters "Name=group-id,Values=$SG_ID_ORIGINAL" \
--output json \
| jq -r '.NetworkInterfaces[] | "\(.NetworkInterfaceId) - \(.Description) - \(.Status)"'
<empty>

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.33
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/kind bug

mtulio avatar Jul 11 '25 23:07 mtulio

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 11 '25 23:07 k8s-ci-robot

cc @elmiko @joelspeed

mtulio avatar Jul 11 '25 23:07 mtulio

@mtulio i'm going to leave this with the needs-triage label for now, we will discuss this issue at the sig meeting on wednesday.

elmiko avatar Jul 14 '25 13:07 elmiko

IIUC, the ELB SG name is generated based on a hash of the name of the service, therefore, shouldn't we be able to check for the existence of a SG for the service independent of the annotation presence or not?

I guess at the moment, when the annotation is present, we skip the SG cleanup logic altogether? Perhaps that was an incorrect decision if I am right

JoelSpeed avatar Jul 14 '25 14:07 JoelSpeed

IIUC, the ELB SG name is generated based on a hash of the name of the service, therefore, shouldn't we be able to check for the existence of a SG for the service independent of the annotation presence or not?

I guess at the moment, when the annotation is present, we skip the SG cleanup logic altogether? Perhaps that was an incorrect decision if I am right

Correct, the SG is replaced in-place when incoming (BYO SG) differs from LB's SG (actual)[1]. IMHO the challenge on clean up the "managed SG" would be identifying that the actual was created by the controller (and not by an BYO SG). I was thinking if we can can use the cluster tag with owned value of SG for that validation?

[1] https://github.com/kubernetes/cloud-provider-aws/blob/948e8f875773f6f5da072ea7ec1fa6726e94dff1/pkg/providers/v1/aws_loadbalancer.go#L1084-L1104

mtulio avatar Jul 14 '25 15:07 mtulio

Cluster owned tag seems like a reasonably good assumption, but wouldn't we also check the name since that is predictable?

JoelSpeed avatar Jul 15 '25 10:07 JoelSpeed

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 13 '25 11:10 k8s-triage-robot

/remove-lifecycle stale

elmiko avatar Oct 16 '25 20:10 elmiko