Managed security group leak after annotation `service.beta.kubernetes.io/aws-load-balancer-security-groups` added to existing service
What happened:
Security Group leak when annotation service.beta.kubernetes.io/aws-load-balancer-security-groups (BYO SG) is added to existing Service type-loadBalancer CLB.
What you expected to happen:
To be discussed with community:
Options I can see so far:
- Security Group created by controller ("managed SG") is removed after LB reconciliation adding BYO SG to ELB/CLB (prior dependencies must be resolved such as removing backend's SG remove rules)
- Controller not allow to update "managed SG" when an annotation is added to existing service (is the a validation webhook for the Service LB?)
- any other?
How to reproduce it (as minimally and precisely as possible):
# Step 1: Create CLB service
SVC_NAME=$APP_NAME_BASE-clb-sg1
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Service
metadata:
name: $SVC_NAME
namespace: ${APP_NAMESPACE}
spec:
selector:
app: $APP_NAME_BASE
ports:
- name: http80
port: 80
targetPort: 8080
protocol: TCP
type: LoadBalancer
EOF
# Step 2: Check the CLB SG
LB_DNS=$(kubectl get svc $SVC_NAME -n ${APP_NAMESPACE} -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
$ aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").SecurityGroups"
[
"sg-022da65a3d6e1d9ba"
]
SG_ID_ORIGINAL=$(aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").SecurityGroups" --output text)
# Step 3: Create a SG to be added ot BYO SG annotation
CLUSTER_ID=$(kubectl get infrastructure cluster -o jsonpath='{.status.infrastructureName}')
VPC_ID=$(aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").VPCId")
SG_NAME="${SVC_NAME}-byosg"
SG_ID=$(aws ec2 create-security-group \
--vpc-id="${VPC_ID}" \
--group-name="${SG_NAME}" \
--description="BYO SG sample for service ${SVC_NAME}" \
--tag-specifications "ResourceType=security-group,Tags=[{Key=Name,Value=${SG_NAME}},{Key=kubernetes.io/cluster/${CLUSTER_ID},Value=shared}]" \
| tee -a | jq -r .GroupId)
$ echo $SG_ID
sg-019ce8390024660f0
# Step 4: Patch the service with BYO SG
kubectl patch service ${SVC_NAME} -n ${APP_NAMESPACE} --type=merge \
--patch '{"metadata":{"annotations":{"service.beta.kubernetes.io/aws-load-balancer-security-groups":"'$SG_ID'"}}}'
# Step 5: Check SG has been added to CLB
$ kubectl get service ${SVC_NAME} -n ${APP_NAMESPACE} -o yaml | yq4 ea .metadata.annotations
service.beta.kubernetes.io/aws-load-balancer-security-groups: sg-019ce8390024660f0
$ aws elb describe-load-balancers | jq -r ".LoadBalancerDescriptions[] | select(.DNSName==\"$LB_DNS\").SecurityGroups"
[
"sg-019ce8390024660f0"
]
$ aws ec2 describe-network-interfaces \
--filters "Name=group-id,Values=$SG_ID" \
--output json | jq -r '.NetworkInterfaces[] | "\(.NetworkInterfaceId) - \(.Description) - \(.Status)"'
eni-01667c7da28262a73 - ELB a341df09f30f94b78b5c33371eec8bac - in-use
eni-0fbdb2e1df215869c - ELB a341df09f30f94b78b5c33371eec8bac - in-use
# Step 6: Check if original SG has not been cleaned (bug/leaked)
## SG exists / not deleted
$ aws ec2 describe-security-groups --group-ids $SG_ID_ORIGINAL \
--query 'SecurityGroups[].{"name":GroupName, "tags":Tags}' \
--output json
[
{
"name": "k8s-elb-a341df09f30f94b78b5c33371eec8bac",
"tags": [
{
"Key": "KubernetesCluster",
"Value": "mrb-sg-xknls"
},
{
"Key": "kubernetes.io/cluster/mrb-sg-xknls",
"Value": "owned"
}
]
}
]
## SG not linked to any ENI
$ aws ec2 describe-network-interfaces \
--filters "Name=group-id,Values=$SG_ID_ORIGINAL" \
--output json \
| jq -r '.NetworkInterfaces[] | "\(.NetworkInterfaceId) - \(.Description) - \(.Status)"'
<empty>
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version): 1.33 - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others:
/kind bug
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
cc @elmiko @joelspeed
@mtulio i'm going to leave this with the needs-triage label for now, we will discuss this issue at the sig meeting on wednesday.
IIUC, the ELB SG name is generated based on a hash of the name of the service, therefore, shouldn't we be able to check for the existence of a SG for the service independent of the annotation presence or not?
I guess at the moment, when the annotation is present, we skip the SG cleanup logic altogether? Perhaps that was an incorrect decision if I am right
IIUC, the ELB SG name is generated based on a hash of the name of the service, therefore, shouldn't we be able to check for the existence of a SG for the service independent of the annotation presence or not?
I guess at the moment, when the annotation is present, we skip the SG cleanup logic altogether? Perhaps that was an incorrect decision if I am right
Correct, the SG is replaced in-place when incoming (BYO SG) differs from LB's SG (actual)[1]. IMHO the challenge on clean up the "managed SG" would be identifying that the actual was created by the controller (and not by an BYO SG). I was thinking if we can can use the cluster tag with owned value of SG for that validation?
[1] https://github.com/kubernetes/cloud-provider-aws/blob/948e8f875773f6f5da072ea7ec1fa6726e94dff1/pkg/providers/v1/aws_loadbalancer.go#L1084-L1104
Cluster owned tag seems like a reasonably good assumption, but wouldn't we also check the name since that is predictable?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale