aws-load-balancer-controller
aws-load-balancer-controller copied to clipboard
Load balancers getting deleted randomly on deletion of ingress records with a whole diff groupname
Describe the bug This has happened twice and all that happened earlier had happened this time as well.
We had few helm releases in namespace=namespace1
in which we were creating ingress record. The group name attached to those ingress records were group1
.
There were few other helm releases in namespace=namespace2
in which we were creating ingress records. The group name attached to these ingress records were group2
.
Now there was some error on ingress records on namespace2
which were not able to reconcile due to following error.
{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}
We never paid attention to this until today when we saw the logs.
Now since helm releases in namespace1
were stale we went ahead and deleted all those helm releases resulting in all the ingress records to get deleted. (assuming this also triggers reconcilation of ingress records in the controller)
This resulted in ingress records in namespace2
to get delete (dont know how why).
From debug I found a audit log where alb controller is setting finalizers for these ingress as null (not pasting here rn, lmk if its needed).
From alb controller logs I Found the following log lines
{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}
{"level":"info","ts":"2023-08-02T08:34:23Z","logger":"controllers.ingress","msg":"successfully built model","model":"{\"id\":\"group2\",\"resources\":{}}"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleting loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:9999999999999999loadbalancer/app/k8s-group2-awdawdawdaw/awdawdawdaw"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleted securityGroup","securityGroupID":"sg-054d"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"successfully deployed model","ingressGroup":"group1"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleted loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:9999999999999999loadbalancer/app/k8s-group2-awdawdawdaw/awdawdawdaw"}
Steps to reproduce Mentioned above ^
Expected outcome
Ingress / Loadbalancers of group2
should not get deleted when deletion is triggered for group1
Environment
production
- AWS Load Balancer controller version: 2.5.2
- Kubernetes version: 1.23
- Using EKS (yes/no), if so version? yes
Additional Context:
From what I found after going through the code https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/d1b8fbb0bc6b8b7639e0be06cfd6693751ac604d/pkg/deploy/elbv2/load_balancer_synthesizer.go#L190
func isSDKLoadBalancerRequiresReplacement(sdkLB LoadBalancerWithTags, resLB *elbv2model.LoadBalancer) bool {
if string(resLB.Spec.Type) != awssdk.StringValue(sdkLB.LoadBalancer.Type) {
return true
}
if resLB.Spec.Scheme != nil && string(*resLB.Spec.Scheme) != awssdk.StringValue(sdkLB.LoadBalancer.Scheme) {
return true
}
return false
}
This code piece can mark the loadbalancer for replacement / deletion when there's the spec does not match. My hunch is that whatever caused the below log line resulted in spec going out of sync and resulting in this function returning true.
{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}
@someshkoli, Hi, do you have multiple controllers in difference namespaces?
@someshkoli, Hi, do you have multiple controllers in difference namespaces?
@oliviassss -ve only one controller
@someshkoli
-
Did the "Ingress" objects in your namespace2 got deleted or just the ALB for the "group2" got deleted? If it's the "Ingress" objects got deleted, then it must be something triggered from your end since our controller didn't delete Ingress objects. You can refer the audit logs to see which user/component trigger the Ingress deletion.
-
As for the
minimum field value of 1, CreateTargetGroupInput.Port.\n"}
error, this is unexpected. Would you post more logs, especially the logs with "successfully built model", where there is large JSON-encoded model. -
As for the code you mentioned, the replacement logic only triggers when you changed the "Schema" or LoadBalancerType, since these fields are immutable in ELB APIs, we have to recreate a replacement one. However, that's not the case per your logs since the model is empty
,"model":"{\"id\":\"group2\",\"resources\":{}}"}
, which means all Ingress in group2 is in "deleting state"
BTW, In general, each Ingress group is reconciled independently, changing one Ingress group shouldn't impact another.
@M00nF1sh Hey,
-
I'm not entirely sure what exactly happened, like the ingress record was missing and from audit log all I found was a patch req from the controller to the ingress record setting finalizers=null.
-
It may be, I had seen it earlier randomly but seeing it right before the deletion line made me curious. (PS: had seen this last time as well when exact same situation had happened), here's the json model that I found in logs
{"level":"info","ts":"2023-08-02T08:34:19Z","logger":"controllers.ingress","msg":"successfully built model","model":{"id":"test-ingress","resources":{"AWS::EC2::SecurityGroup":{"ManagedLBSecurityGroup":{"spec":{"groupName":"k8s-group2-907ce91fe3","description":"[k8s] Managed SecurityGroup for LoadBalancer","ingress":[{"ipProtocol":"tcp","fromPort":443,"toPort":443,"ipRanges":[{"cidrIP":"0.0.0.0/0"}]},{"ipProtocol":"tcp","fromPort":80,"toPort":80,"ipRanges":[{"cidrIP":"0.0.0.0/0"}]}]}}},"AWS::ElasticLoadBalancingV2::Listener":{"80":{"spec":{"loadBalancerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::LoadBalancer/LoadBalancer/status/loadBalancerARN"},"port":80,"protocol":"HTTP","defaultActions":[{"type":"fixed-response","fixedResponseConfig":{"contentType":"text/plain","statusCode":"404"}}]}},"443":{"spec":{"loadBalancerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::LoadBalancer/LoadBalancer/status/loadBalancerARN"},"port":443,"protocol":"HTTPS","defaultActions":[{"type":"fixed-response","fixedResponseConfig":{"contentType":"text/plain","statusCode":"404"}}],"certificates":[{"certificateARN":"arn:aws:acm:us-east-1:999499138329:certificate/60f46466-c2f4-43e7-a30f-fa201b99f8ba"}],"sslPolicy":"ELBSecurityPolicy-2016-08"}}},"AWS::ElasticLoadBalancingV2::ListenerRule":{"443:1":{"spec":{"listenerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::Listener/443/status/listenerARN"},"priority":1,"actions":[{"type":"forward","forwardConfig":{"targetGroups":[{"targetGroupARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/helm/app1:8081/status/targetGroupARN"}}]}}],"conditions":[{"field":"host-header","hostHeaderConfig":{"values":["app1.test-domain.com"]}},{"field":"path-pattern","pathPatternConfig":{"values":["*"]}}]}},"80:1":{"spec":{"listenerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::Listener/80/status/listenerARN"},"priority":1,"actions":[{"type":"redirect","redirectConfig":{"port":"443","protocol":"HTTPS","statusCode":"HTTP_301"}}],"conditions":[{"field":"host-header","hostHeaderConfig":{"values":["app1.test-domain.com"]}},{"field":"path-pattern","pathPatternConfig":{"values":["*"]}}]}}},"AWS::ElasticLoadBalancingV2::LoadBalancer":{"LoadBalancer":{"spec":{"name":"k8s-group1-997d1c003f","type":"application","scheme":"internet-facing","ipAddressType":"ipv4","subnetMapping":[{"subnetID":"subnet-00000000000000000"},{"subnetID":"subnet-00000000000000000"}],"securityGroups":[{"$ref":"#/resources/AWS::EC2::SecurityGroup/ManagedLBSecurityGroup/status/groupID"},"sg-00000000000000000"]}}},"AWS::ElasticLoadBalancingV2::TargetGroup":{"helm/app1:8081":{"spec":{"name":"k8s-helm-app1-c423627cdc","targetType":"instance","port":0,"protocol":"HTTP","protocolVersion":"HTTP1","ipAddressType":"ipv4","healthCheckConfig":{"port":"traffic-port","protocol":"HTTP","path":"/","matcher":{"httpCode":"200"},"intervalSeconds":15,"timeoutSeconds":5,"healthyThresholdCount":2,"unhealthyThresholdCount":2}}}},"K8S::ElasticLoadBalancingV2::TargetGroupBinding":{"helm/app1:8081":{"spec":{"template":{"metadata":{"name":"k8s-helm-app1-c423627cdc","namespace":"helm","creationTimestamp":null},"spec":{"targetGroupARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/helm/app1:8081/status/targetGroupARN"},"targetType":"instance","serviceRef":{"name":"app1","port":8081},"networking":{"ingress":[{"from":[{"securityGroup":{"groupID":"sg-00000000000000000"}}],"ports":[{"protocol":"TCP","port":0}]}]},"ipAddressType":"ipv4"}}}}}}}}
However, that's not the case per your logs since the model is empty ,"model":"{"id":"group2","resources":{}}"}, which means all Ingress in group2 is in "deleting state"
yes exactly my concern, how did this happen in first place. I'm assuming this is what caused the lb to get marked as deleted -> causing controller to send finalizer null signal to the ingress record -> then making the ingress to get queued for deletion. I havent looked into the code yet but is it possible that the broken model (the error that ive sent you), is getting applied (somehow) setting entire resource model to empty {} resulting in the schema condition to go deletion state ?
BTW, In general, each Ingress group is reconciled independently, changing one Ingress group shouldn't impact another.
That is how its supposed to behave, ive no idea why this happened. (twice)
I'm having trouble figuring out where that line could be logged from.
@johngmyers which one ?
minimum field value of 1, CreateTargetGroupInput.Port.\n"}
this ? So I found that this error pops up when you have alb.ingress.kubernetes.io/target-type: instance
and underlying service type as ClusterIP
which is a fair but error is a bit misleading. But its well documented.
Another interesting thing that I found while trying to replicate this entire thing
PS: this is whole new thing, might raise new issue for this
when a faulty ingress (i1) is applied with group g1 and host entry h1 -> reconcilation fails -> alb is not allocated -> apply another faulty ingress (i2) with group g1 and host entry h2
You will notice that ingress record i2 now has host entry as h1, I thought this is a reconciliation issue and might get fixed post fixing the fault in ingress, but on fixing the fault it kept the host h1 in ingress i2 :skull:
PS: by fault above I mean, set alb.ingress.kubernetes.io/target-type: instance
on a ClusterIP
service
We also encountered this issue. An Ingress
resource was deleted in namespace1
and LoadBalancers for 3 ingresses in namespace2
were deleted. This caused an outage for 3 services, the Ingress
resources for these 3 didn't change other than the LB hostname status field eventually went blank.
Annotations in use for the 3 ingresses that had their LoadBalancers incorrectly deleted:
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:xxxxxxxxxxxxx:certificate/xxxxxxxxx
alb.ingress.kubernetes.io/healthcheck-path: /
alb.ingress.kubernetes.io/healthcheck-port: "80"
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
kubernetes.io/ingress.class: alb
The Service
type is ClusterIP
in all cases.
This occurred on v2.4.5 (image version), helm chart v1.4.6
It is extremely alarming that this can happen.
We also encountered this issue. An
Ingress
resource was deleted innamespace1
and LoadBalancers for 3 ingresses innamespace2
were deleted. This caused and outage for 3 services, theIngress
resources for these 3 didn't change other than the LB hostname status field eventually went blank.Annotations in use for the 3 ingresses that had their LoadBalancers incorrectly deleted:
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:xxxxxxxxxxxxx:certificate/xxxxxxxxx alb.ingress.kubernetes.io/healthcheck-path: / alb.ingress.kubernetes.io/healthcheck-port: "80" alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/target-type: ip kubernetes.io/ingress.class: alb
The
Service
type isClusterIP
in all cases.This occurred on v2.4.5 (image version), helm chart v1.4.6
It is extremely alarming that this can happen.
finally someone who can relate, we've had such outage twice and since there was no update on the conversation I had started thinking that I might've deleted it by mistake (somehow randomly).
I tried reproducing it but couldn't, wau ?
Controller logs for the sync that did the inappropriate deletion would be helpful.
Unfortunately logs for this controller weren't being shipped at the time and the pods were restarted during troubleshooting so we lost them. I do have the CloudTrail events that show that the IRSA role the controller was using is what did the deletion, but not much other than that.
I have container logs, lmk if you want me to send ya ? Hopefully they don't contain any sensitive information? Ps: it's native logs haven't touched them
Oh also, I should note that deleting and recreating the Ingress
resources fixed it immediately. I've been testing v2.6.1
in a dev cluster, I manually deleted the AWS LoadBalancer resources, and the controller starts throwing 403 IAM errors like this:
{"level":"error","ts":"2023-09-14T17:52:06Z","msg":"Reconciler error","controller":"ingress","object":{"name":"cd-demo-frontend","namespace":"development"},"namespace":"development","name":"cd-demo-frontend","reconcileID":"4ecd3e0a-5acd-47ab-8127-6f4fdd1fc6d6","error":"AccessDenied: User: arn:aws:sts::XXXXXXX:assumed-role/alb-ingress-irsa-role/XXXXX is not authorized to perform: elasticloadbalancing:AddTags on resource: arn:aws:elasticloadbalancing:us-west-2:XXXXXX:targetgroup/k8s-developm-cddemofr-d994612186/* because no identity-based policy allows the elasticloadbalancing:AddTags action\n\tstatus code: 403, request id: 2a3686f4-d682-4fe4-b3b3-54e1e7be32ec"}
I waited > 10 hours for the default --sync-period
but it didn't recreate them.
@blakebarnett, this is a separate issue, see: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3383#issuecomment-1718066437
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.