aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

Load balancers getting deleted randomly on deletion of ingress records with a whole diff groupname

Open someshkoli opened this issue 1 year ago • 17 comments

Describe the bug This has happened twice and all that happened earlier had happened this time as well.

We had few helm releases in namespace=namespace1 in which we were creating ingress record. The group name attached to those ingress records were group1 .

There were few other helm releases in namespace=namespace2 in which we were creating ingress records. The group name attached to these ingress records were group2.

Now there was some error on ingress records on namespace2 which were not able to reconcile due to following error.

{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}

We never paid attention to this until today when we saw the logs.

Now since helm releases in namespace1 were stale we went ahead and deleted all those helm releases resulting in all the ingress records to get deleted. (assuming this also triggers reconcilation of ingress records in the controller)

This resulted in ingress records in namespace2 to get delete (dont know how why). From debug I found a audit log where alb controller is setting finalizers for these ingress as null (not pasting here rn, lmk if its needed). From alb controller logs I Found the following log lines

{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}
{"level":"info","ts":"2023-08-02T08:34:23Z","logger":"controllers.ingress","msg":"successfully built model","model":"{\"id\":\"group2\",\"resources\":{}}"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleting loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:9999999999999999loadbalancer/app/k8s-group2-awdawdawdaw/awdawdawdaw"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleted securityGroup","securityGroupID":"sg-054d"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"successfully deployed model","ingressGroup":"group1"}
{"level":"info","ts":"2023-08-02T08:34:25Z","logger":"controllers.ingress","msg":"deleted loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:9999999999999999loadbalancer/app/k8s-group2-awdawdawdaw/awdawdawdaw"}

Steps to reproduce Mentioned above ^

Expected outcome Ingress / Loadbalancers of group2 should not get deleted when deletion is triggered for group1 Environment production

  • AWS Load Balancer controller version: 2.5.2
  • Kubernetes version: 1.23
  • Using EKS (yes/no), if so version? yes

Additional Context:

someshkoli avatar Aug 02 '23 13:08 someshkoli

From what I found after going through the code https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/d1b8fbb0bc6b8b7639e0be06cfd6693751ac604d/pkg/deploy/elbv2/load_balancer_synthesizer.go#L190

func isSDKLoadBalancerRequiresReplacement(sdkLB LoadBalancerWithTags, resLB *elbv2model.LoadBalancer) bool {
	if string(resLB.Spec.Type) != awssdk.StringValue(sdkLB.LoadBalancer.Type) {
		return true
	}
	if resLB.Spec.Scheme != nil && string(*resLB.Spec.Scheme) != awssdk.StringValue(sdkLB.LoadBalancer.Scheme) {
		return true
	}
	return false
}

This code piece can mark the loadbalancer for replacement / deletion when there's the spec does not match. My hunch is that whatever caused the below log line resulted in spec going out of sync and resulting in this function returning true.

{"level":"error","ts":"2023-08-02T08:34:21Z","msg":"Reconciler error","controller":"ingress","object":{"name":"group2"},"namespace":"","name":"group2","reconcileID":"82e1580e","error":"InvalidParameter: 1 validation error(s) found.\n- minimum field value of 1, CreateTargetGroupInput.Port.\n"}

someshkoli avatar Aug 02 '23 13:08 someshkoli

@someshkoli, Hi, do you have multiple controllers in difference namespaces?

oliviassss avatar Aug 02 '23 20:08 oliviassss

@someshkoli, Hi, do you have multiple controllers in difference namespaces?

@oliviassss -ve only one controller

someshkoli avatar Aug 02 '23 20:08 someshkoli

@someshkoli

  1. Did the "Ingress" objects in your namespace2 got deleted or just the ALB for the "group2" got deleted? If it's the "Ingress" objects got deleted, then it must be something triggered from your end since our controller didn't delete Ingress objects. You can refer the audit logs to see which user/component trigger the Ingress deletion.

  2. As for the minimum field value of 1, CreateTargetGroupInput.Port.\n"} error, this is unexpected. Would you post more logs, especially the logs with "successfully built model", where there is large JSON-encoded model.

  3. As for the code you mentioned, the replacement logic only triggers when you changed the "Schema" or LoadBalancerType, since these fields are immutable in ELB APIs, we have to recreate a replacement one. However, that's not the case per your logs since the model is empty ,"model":"{\"id\":\"group2\",\"resources\":{}}"}, which means all Ingress in group2 is in "deleting state"

BTW, In general, each Ingress group is reconciled independently, changing one Ingress group shouldn't impact another.

M00nF1sh avatar Aug 02 '23 22:08 M00nF1sh

@M00nF1sh Hey,

  1. I'm not entirely sure what exactly happened, like the ingress record was missing and from audit log all I found was a patch req from the controller to the ingress record setting finalizers=null.

  2. It may be, I had seen it earlier randomly but seeing it right before the deletion line made me curious. (PS: had seen this last time as well when exact same situation had happened), here's the json model that I found in logs {"level":"info","ts":"2023-08-02T08:34:19Z","logger":"controllers.ingress","msg":"successfully built model","model":{"id":"test-ingress","resources":{"AWS::EC2::SecurityGroup":{"ManagedLBSecurityGroup":{"spec":{"groupName":"k8s-group2-907ce91fe3","description":"[k8s] Managed SecurityGroup for LoadBalancer","ingress":[{"ipProtocol":"tcp","fromPort":443,"toPort":443,"ipRanges":[{"cidrIP":"0.0.0.0/0"}]},{"ipProtocol":"tcp","fromPort":80,"toPort":80,"ipRanges":[{"cidrIP":"0.0.0.0/0"}]}]}}},"AWS::ElasticLoadBalancingV2::Listener":{"80":{"spec":{"loadBalancerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::LoadBalancer/LoadBalancer/status/loadBalancerARN"},"port":80,"protocol":"HTTP","defaultActions":[{"type":"fixed-response","fixedResponseConfig":{"contentType":"text/plain","statusCode":"404"}}]}},"443":{"spec":{"loadBalancerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::LoadBalancer/LoadBalancer/status/loadBalancerARN"},"port":443,"protocol":"HTTPS","defaultActions":[{"type":"fixed-response","fixedResponseConfig":{"contentType":"text/plain","statusCode":"404"}}],"certificates":[{"certificateARN":"arn:aws:acm:us-east-1:999499138329:certificate/60f46466-c2f4-43e7-a30f-fa201b99f8ba"}],"sslPolicy":"ELBSecurityPolicy-2016-08"}}},"AWS::ElasticLoadBalancingV2::ListenerRule":{"443:1":{"spec":{"listenerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::Listener/443/status/listenerARN"},"priority":1,"actions":[{"type":"forward","forwardConfig":{"targetGroups":[{"targetGroupARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/helm/app1:8081/status/targetGroupARN"}}]}}],"conditions":[{"field":"host-header","hostHeaderConfig":{"values":["app1.test-domain.com"]}},{"field":"path-pattern","pathPatternConfig":{"values":["*"]}}]}},"80:1":{"spec":{"listenerARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::Listener/80/status/listenerARN"},"priority":1,"actions":[{"type":"redirect","redirectConfig":{"port":"443","protocol":"HTTPS","statusCode":"HTTP_301"}}],"conditions":[{"field":"host-header","hostHeaderConfig":{"values":["app1.test-domain.com"]}},{"field":"path-pattern","pathPatternConfig":{"values":["*"]}}]}}},"AWS::ElasticLoadBalancingV2::LoadBalancer":{"LoadBalancer":{"spec":{"name":"k8s-group1-997d1c003f","type":"application","scheme":"internet-facing","ipAddressType":"ipv4","subnetMapping":[{"subnetID":"subnet-00000000000000000"},{"subnetID":"subnet-00000000000000000"}],"securityGroups":[{"$ref":"#/resources/AWS::EC2::SecurityGroup/ManagedLBSecurityGroup/status/groupID"},"sg-00000000000000000"]}}},"AWS::ElasticLoadBalancingV2::TargetGroup":{"helm/app1:8081":{"spec":{"name":"k8s-helm-app1-c423627cdc","targetType":"instance","port":0,"protocol":"HTTP","protocolVersion":"HTTP1","ipAddressType":"ipv4","healthCheckConfig":{"port":"traffic-port","protocol":"HTTP","path":"/","matcher":{"httpCode":"200"},"intervalSeconds":15,"timeoutSeconds":5,"healthyThresholdCount":2,"unhealthyThresholdCount":2}}}},"K8S::ElasticLoadBalancingV2::TargetGroupBinding":{"helm/app1:8081":{"spec":{"template":{"metadata":{"name":"k8s-helm-app1-c423627cdc","namespace":"helm","creationTimestamp":null},"spec":{"targetGroupARN":{"$ref":"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/helm/app1:8081/status/targetGroupARN"},"targetType":"instance","serviceRef":{"name":"app1","port":8081},"networking":{"ingress":[{"from":[{"securityGroup":{"groupID":"sg-00000000000000000"}}],"ports":[{"protocol":"TCP","port":0}]}]},"ipAddressType":"ipv4"}}}}}}}}

However, that's not the case per your logs since the model is empty ,"model":"{"id":"group2","resources":{}}"}, which means all Ingress in group2 is in "deleting state"

yes exactly my concern, how did this happen in first place. I'm assuming this is what caused the lb to get marked as deleted -> causing controller to send finalizer null signal to the ingress record -> then making the ingress to get queued for deletion. I havent looked into the code yet but is it possible that the broken model (the error that ive sent you), is getting applied (somehow) setting entire resource model to empty {} resulting in the schema condition to go deletion state ?

BTW, In general, each Ingress group is reconciled independently, changing one Ingress group shouldn't impact another.

That is how its supposed to behave, ive no idea why this happened. (twice)

someshkoli avatar Aug 02 '23 22:08 someshkoli

I'm having trouble figuring out where that line could be logged from.

johngmyers avatar Aug 06 '23 21:08 johngmyers

@johngmyers which one ?

minimum field value of 1, CreateTargetGroupInput.Port.\n"}

this ? So I found that this error pops up when you have alb.ingress.kubernetes.io/target-type: instance and underlying service type as ClusterIP which is a fair but error is a bit misleading. But its well documented.

someshkoli avatar Aug 08 '23 20:08 someshkoli

Another interesting thing that I found while trying to replicate this entire thing

PS: this is whole new thing, might raise new issue for this

when a faulty ingress (i1) is applied with group g1 and host entry h1 -> reconcilation fails -> alb is not allocated -> apply another faulty ingress (i2) with group g1 and host entry h2

You will notice that ingress record i2 now has host entry as h1, I thought this is a reconciliation issue and might get fixed post fixing the fault in ingress, but on fixing the fault it kept the host h1 in ingress i2 :skull:

PS: by fault above I mean, set alb.ingress.kubernetes.io/target-type: instance on a ClusterIP service

someshkoli avatar Aug 08 '23 20:08 someshkoli

We also encountered this issue. An Ingress resource was deleted in namespace1 and LoadBalancers for 3 ingresses in namespace2 were deleted. This caused an outage for 3 services, the Ingress resources for these 3 didn't change other than the LB hostname status field eventually went blank.

Annotations in use for the 3 ingresses that had their LoadBalancers incorrectly deleted:

    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:xxxxxxxxxxxxx:certificate/xxxxxxxxx
    alb.ingress.kubernetes.io/healthcheck-path: /
    alb.ingress.kubernetes.io/healthcheck-port: "80"
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    kubernetes.io/ingress.class: alb

The Service type is ClusterIP in all cases.

This occurred on v2.4.5 (image version), helm chart v1.4.6

It is extremely alarming that this can happen.

blakebarnett avatar Sep 14 '23 17:09 blakebarnett

We also encountered this issue. An Ingress resource was deleted in namespace1 and LoadBalancers for 3 ingresses in namespace2 were deleted. This caused and outage for 3 services, the Ingress resources for these 3 didn't change other than the LB hostname status field eventually went blank.

Annotations in use for the 3 ingresses that had their LoadBalancers incorrectly deleted:

    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:xxxxxxxxxxxxx:certificate/xxxxxxxxx
    alb.ingress.kubernetes.io/healthcheck-path: /
    alb.ingress.kubernetes.io/healthcheck-port: "80"
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    kubernetes.io/ingress.class: alb

The Service type is ClusterIP in all cases.

This occurred on v2.4.5 (image version), helm chart v1.4.6

It is extremely alarming that this can happen.

finally someone who can relate, we've had such outage twice and since there was no update on the conversation I had started thinking that I might've deleted it by mistake (somehow randomly).

I tried reproducing it but couldn't, wau ?

someshkoli avatar Sep 14 '23 17:09 someshkoli

Controller logs for the sync that did the inappropriate deletion would be helpful.

johngmyers avatar Sep 14 '23 18:09 johngmyers

Unfortunately logs for this controller weren't being shipped at the time and the pods were restarted during troubleshooting so we lost them. I do have the CloudTrail events that show that the IRSA role the controller was using is what did the deletion, but not much other than that.

blakebarnett avatar Sep 14 '23 18:09 blakebarnett

I have container logs, lmk if you want me to send ya ? Hopefully they don't contain any sensitive information? Ps: it's native logs haven't touched them

someshkoli avatar Sep 14 '23 18:09 someshkoli

Oh also, I should note that deleting and recreating the Ingress resources fixed it immediately. I've been testing v2.6.1 in a dev cluster, I manually deleted the AWS LoadBalancer resources, and the controller starts throwing 403 IAM errors like this:

{"level":"error","ts":"2023-09-14T17:52:06Z","msg":"Reconciler error","controller":"ingress","object":{"name":"cd-demo-frontend","namespace":"development"},"namespace":"development","name":"cd-demo-frontend","reconcileID":"4ecd3e0a-5acd-47ab-8127-6f4fdd1fc6d6","error":"AccessDenied: User: arn:aws:sts::XXXXXXX:assumed-role/alb-ingress-irsa-role/XXXXX is not authorized to perform: elasticloadbalancing:AddTags on resource: arn:aws:elasticloadbalancing:us-west-2:XXXXXX:targetgroup/k8s-developm-cddemofr-d994612186/* because no identity-based policy allows the elasticloadbalancing:AddTags action\n\tstatus code: 403, request id: 2a3686f4-d682-4fe4-b3b3-54e1e7be32ec"}

I waited > 10 hours for the default --sync-period but it didn't recreate them.

blakebarnett avatar Sep 14 '23 18:09 blakebarnett

@blakebarnett, this is a separate issue, see: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3383#issuecomment-1718066437

oliviassss avatar Sep 14 '23 18:09 oliviassss

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 28 '24 10:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 27 '24 11:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 28 '24 12:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 28 '24 12:03 k8s-ci-robot