kube-ingress-aws-controller icon indicating copy to clipboard operation
kube-ingress-aws-controller copied to clipboard

Is there a way to migrate from 0.11.14 through 0.12.16 to 0.12.17+ without downtime? Target group creating empty at first

Open jbilliau-rcd opened this issue 2 years ago • 23 comments

Hi, we have been using this ingress controller for years, and are currently on 0.11.14. We have been testing upgrade to 0.12.16 which work flawlessly and, while the Cloudformation stack is updated for some reason, the actual resources don't change; there's no downtime.

If we try going to 0.12.17 however (not using the new experimental CNI), the controller comes up, updates our CF stacks, and ends up creating and attaching a new TG, while deleting the old one. The problem? This new TG has no instances in it for 30 seconds, at which point it does. This would incur a 30 seconds downtime across all of production if we went with this, not sure if we can get away with that.

Are there any flags or workarounds to prevent this? Any idea what 0.12.17 is doing differently that it needs to recreate the TG entirely, and empty at first no-less?

Screenshot of it happening right after the CF update:

image

Logs of controller as it comes up on new 0.12.17 image (from 0.11.14)

time="2022-05-12T17:38:11Z" level=info msg="starting /kube-ingress-aws-controller v0.12.17"
--
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Kubernetes API server: "
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Cluster ID: jason-dev-9999999-us-east-2"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="VPC ID: vpc-9999999999"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Instance ID: i-05ded1d34e4817610"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Security group ID: sg-018f425e420f2db38"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Internal subnet IDs: [subnet-046134d61fe120cbc subnet-09fa2d9414fe47482 subnet-0f876b2913eb84f4b]"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Public subnet IDs: [subnet-06f1a762865b7ad3c subnet-086df96fe15ae9076 subnet-013c830b5b448157e]"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="EC2 filters: tag:kubernetes.io/cluster/jason-dev-418023852230-us-east-2=owned tag-key=k8s.io/role/node"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Certificates per ALB: 24 (SNI: true)"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Blacklisted Certificate ARNs (0): "
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Ingress class filters: alb-istio"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="ALB Logging S3 Bucket: "
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="ALB Logging S3 Prefix: "
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="CloudWatch Alarm ConfigMap: "
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Default LoadBalancer type: application"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Target access mode: HostPort"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=warning msg="Disabling RouteGroup support because listing RouteGroups failed: resource not found, to get more information https://opensource.zalando.com/skipper/kubernetes/routegroups/#routegroups"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Found 3 owned auto scaling group(s)"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Found 3 targeted auto scaling group(s)"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Found 0 single instance(s)"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Found 0 EC2 instance(s)"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Found 16 certificate(s)"
Thu, May 12 2022 1:40:17 pm | time="2022-05-12T17:40:17Z" level=info msg="Found 0 cloudwatch alarm configuration(s)"
Thu, May 12 2022 1:40:21 pm | time="2022-05-12T17:40:21Z" level=info msg="Updating \"internet-facing\" stack for 1 certificates / 1 ingresses"
Thu, May 12 2022 1:40:21 pm | time="2022-05-12T17:40:21Z" level=info msg="Stack \"arn:aws:cloudformation:us-east-2:999999999:stack/kube-ingress-aws-controller-jason-dev-99999999-us-east-2-a173ffb1-df89-4dfc-86ef-d3e3c8495134/d0bb7440-d208-11ec-86f2-028a53c58cfe\" for certificate map[\"arn:aws:acm:us-east-2:99999999:certificate/eeb15f56-192c-41e5-be87-f1edfb9d9c57\":\"0001-01-01 00:00:00 +0000 UTC\"] updated"
Thu, May 12 2022 1:40:21 pm | time="2022-05-12T17:40:21Z" level=info msg="Updating \"internet-facing\" stack for 1 certificates / 1 ingresses"
Thu, May 12 2022 1:40:21 pm | time="2022-05-12T17:40:21Z" level=info msg="Stack \"arn:aws:cloudformation:us-east-2:999999999:stack/kube-ingress-aws-controller-jason-dev-999999999-us-east-2-c8b74d37-e541-42da-805e-026509a41812/11bd7aa0-b1da-11ec-85d9-062a1dd6a318\" for certificate map[\"arn:aws:acm:us-east-2:99999999:certificate/eeb15f56-192c-41e5-be87-f1edfb9d9c57\":\"0001-01-01 00:00:00 +0000 UTC\"] updated"
Thu, May 12 2022 1:40:51 pm | time="2022-05-12T17:40:51Z" level=info msg="Found 3 owned auto scaling group(s)"
Thu, May 12 2022 1:40:51 pm | time="2022-05-12T17:40:51Z" level=info msg="Found 3 targeted auto scaling group(s)"
Thu, May 12 2022 1:40:51 pm | time="2022-05-12T17:40:51Z" level=info msg="Found 0 single instance(s)"
Thu, May 12 2022 1:40:51 pm | time="2022-05-12T17:40:51Z" level=info msg="Found 0 EC2 instance(s)"
Thu, May 12 2022 1:40:51 pm | time="2022-05-12T17:40:51Z" level=info msg="Found 16 certificate(s)"
Thu, May 12 2022 1:40:51 pm | time="2022-05-12T17:40:51Z" level=info msg="Found 0 cloudwatch alarm configuration(s)"


jbilliau-rcd avatar May 12 '22 17:05 jbilliau-rcd

Hello. Thank you for bringing this up. The problem is the change in CF template we've introduced in >=0.12.17. It sets previously absent TargetType to the default value elbv2.TargetTypeEnumInstance and triggers target group re-creation.

https://github.com/zalando-incubator/kube-ingress-aws-controller/compare/v0.12.16...v0.12.17

git --no-pager diff v0.12.16..v0.12.17 -- aws/cf_template.go
diff --git a/aws/cf_template.go b/aws/cf_template.go
index c5b0af0..75a2b7b 100644
--- a/aws/cf_template.go
+++ b/aws/cf_template.go
@@ -8,6 +8,7 @@ import (
        "crypto/sha256"
        "sort"
 
+       "github.com/aws/aws-sdk-go/service/elbv2"
        cloudformation "github.com/mweagle/go-cloudformation"
 )
 
@@ -444,6 +445,10 @@ func generateDenyInternalTrafficRule(listenerName string, rulePriority int64, in
 }
 
 func newTargetGroup(spec *stackSpec, targetPortParameter string) *cloudformation.ElasticLoadBalancingV2TargetGroup {
+       targetType := elbv2.TargetTypeEnumInstance
+       if spec.targetAccessModeCNI {
+               targetType = elbv2.TargetTypeEnumIp
+       }
        protocol := "HTTP"
        healthCheckProtocol := "HTTP"
        healthyThresholdCount, unhealthyThresholdCount := spec.albHealthyThresholdCount, spec.albUnhealthyThresholdCount
@@ -472,6 +477,7 @@ func newTargetGroup(spec *stackSpec, targetPortParameter string) *cloudformation
                UnhealthyThresholdCount:    cloudformation.Integer(int64(unhealthyThresholdCount)),
                Port:                       cloudformation.Ref(targetPortParameter).Integer(),
                Protocol:                   cloudformation.String(protocol),
+               TargetType:                 cloudformation.String(targetType),
                VPCID:                      cloudformation.Ref(parameterTargetGroupVPCIDParameter).String(),
        }

Are there any flags or workarounds to prevent this?

Currently there is no workaround. The prospective fix should take into account controller deployments that have gone through the update already and for them it should not trigger downtime once again.

AlexanderYastrebov avatar May 13 '22 09:05 AlexanderYastrebov

@AlexanderYastrebov ah ok, I was wondering why it was recreating the target group, that explains it. Is there a way for the controller to "wait" for the target group to be populated before swapping it in? It throws it in and rips out the healthy one without any validation. Just a thought....if not, it is what it is I suppose.

jbilliau-rcd avatar May 13 '22 13:05 jbilliau-rcd

so I'm trying out the new experimental CNI mode, since if im taking downtime, i might as well try converting to this. The controller keeps endlessly updating my CF stack for some reason:

image image

Is that intentional?

jbilliau-rcd avatar May 13 '22 15:05 jbilliau-rcd

@universam1 might have an idea

szuecs avatar May 13 '22 18:05 szuecs

Haven't seen this ever - granted we are running it already in dozens of production clusters. Is that reproducible, can you post your config?

universam1 avatar May 15 '22 11:05 universam1

Regarding a workaround I could think of a tri-state such by changing the parameter to a string, so that the cfn value is only set when the parameter is supplied. I can prepare that if you wish

universam1 avatar May 16 '22 09:05 universam1

You mean for my original issue, of the target group being destroyed and reattached completely empty for a period of time? Yes, if that's fixable via a workaround, that would be amazing. We have to upgrade to 0.12.17+ due to this BoundServiceAccountToken issue with K8 1.21, and doing it without having a global 30 second downtime would be ideal!

jbilliau-rcd avatar May 16 '22 14:05 jbilliau-rcd

@jbilliau-rcd can you share the cli flags you start the controller with? I think this is what @universam1 wants to understand in https://github.com/zalando-incubator/kube-ingress-aws-controller/issues/507#issuecomment-1126912769

szuecs avatar May 19 '22 18:05 szuecs

ah ok, we run with these arguments:

args:
    - --health-check-path=/healthz/ready
    - --health-check-port=31370
    - --target-port=31380
    - --ingress-class-filter=alb-istio
    - --controller-id=istio-gw
    - --redirect-http-to-https
    - --deregistration-delay-timeout=60s
    - --idle-connection-timeout=310s

Though in testing our EKS 1.22, are also needing to add this as well, until it's fixed/defaulted in a future release:

- --ingress-api-version=networking.k8s.io/v1

jbilliau-rcd avatar May 19 '22 20:05 jbilliau-rcd

@universam1 any luck yet? Just checking in.

jbilliau-rcd avatar May 26 '22 13:05 jbilliau-rcd

@jbilliau-rcd I can see the CF update triggered without changes every couple of days. Honestly I don't know if that is desired or not, @szuecs will know better! Could it be related to the controller pod restarts?

Btw. in order to enable the AWSCNI mode you'll have to define following args:

--target-access-mode=AWSCNI
--target-cni-namespace=kube-system
--target-cni-pod-labelselector=application=skipper-ingress

Judging your command lines, you haven't actually enabled the CNI mode!?

universam1 avatar May 26 '22 13:05 universam1

@universam1 sorry, maybe I misunderstood. When you wrote this:

Regarding a workaround I could think of a tri-state such by changing the parameter to a string, so that the cfn value is only set when the parameter is supplied. I can prepare that if you wish

I assumed you meant you had a way of fixing the usual/normal way of attaching target groups with ec2's in them so it wouldnt create a new target group, making upgrading to the latest version of this controller a downtime-inducing change.

If you were talking about the constant looping when I was trying out CNI mode, that was using these arguments:

- --health-check-path=/healthz/ready
- --health-check-port=15021
- --target-port=8080
- --ingress-class-filter=alb-istio
- --controller-id=istio-gw
- --redirect-http-to-https
- --deregistration-delay-timeout=60s
- --idle-connection-timeout=310s
- --target-access-mode=AWSCNI
- --target-cni-namespace=istio-system
- --target-cni-pod-labelselector=app=istio-ingressgateway
- --alb-healthy-threshold-count=2

jbilliau-rcd avatar May 26 '22 13:05 jbilliau-rcd

Regarding a workaround I could think of a tri-state such by changing the parameter to a string, so that the cfn value is only set when the parameter is supplied. I can prepare that if you wish

I assumed you meant you had a way of fixing the usual/normal way of attaching target groups with ec2's in them so it wouldnt create a new target group, making upgrading to the latest version of this controller a downtime-inducing change.

I think the idea here is to add a workaround flag, say -default-target-type equal to "instance" by default and use its value to initialize targetType variable from https://github.com/zalando-incubator/kube-ingress-aws-controller/issues/507#issuecomment-1125857477 The logic would also change to set TargetType field only if the value is not empty.

To upgrade from <0.12.17 users would need to set this flag to empty and keep it forever (because before 0.12.17 TargetType filed was never set)

Alternatively the default value could be set to empty - then all the deployments that have already updated to >=0.12.17 with downtime would need to set its value to "instance" and keep forever to prevent a second downtime.

AlexanderYastrebov avatar May 26 '22 23:05 AlexanderYastrebov

Third idea that I can think of could be extending the use of the existing --target-access-mode flag like this

  • --target-access-mode="" or undefined (default) - for <0.12.17 users which allows them a seamless upgrade
  • --target-access-mode=instance for >=0.12.17 users to prevent a template change
  • --target-access-mode=AWSCNI as before

This way only the instance users that have already gone through the downtime need to set this flag to prevent further changes. The advantage is, that these users can already as of today set this value to be safe, they do not need to wait or sync up with the release of the fix.

universam1 avatar May 27 '22 09:05 universam1

@AlexanderYastrebov @universam1 loving both ideas, whatever allows me to use the most recent version and not have my target groups recreated. We are wanting to upgrade due to the 1.21 changes to BoundServiceAccountToken, EKS currently gives you a 90 day window before the tokens expire so across our 78 clusters, this controller randomly starts getting unauthorized errors until we redeploy it. The newer code now supports auto-rotation of this due to the newer code SDK it's using.

So @universam1 with your idea, I could just add --target-access-mode=instance to my args, go to the latest release, deploy to all my clusters, and be running the newest release with no target group changes? If so, would you guys mind making that change? i'm not a developer, otherwise I would; just a humble infrastructure engineer trying to get by, haha.

jbilliau-rcd avatar May 27 '22 14:05 jbilliau-rcd

@jbilliau-rcd regarding option #3 https://github.com/zalando-incubator/kube-ingress-aws-controller/issues/507#issuecomment-1139446142 it would be the other way around, in your case you would omit --target-access-mode or set it to "" for a seamless migration. As soon as one would set this parameter, the template would change.

But I'm not a stakeholder of this project, so @szuecs @AlexanderYastrebov need to do the call I guess.

universam1 avatar May 27 '22 14:05 universam1

@universam1 so are you saying --target-access-mode, prior to 0.12.17, used to not be set, and now has a default in 0.12.17+ that is recreating the target group? So if I explicitly set it to "" in my container arguments, I would "replicating" what it used to be in code and thus, my TG wouldn't change?

jbilliau-rcd avatar Jun 11 '22 15:06 jbilliau-rcd

That is a suggestion for a potential implementation - but that all depends on the maintainer if they are accepting that or not 🤷🏻

universam1 avatar Jun 13 '22 15:06 universam1

@AlexanderYastrebov I am fine with either solution you or @universam1 explained. I would suggest just to increase the minor version again to tell users what to do in order to prevent the incident. What do you think?

szuecs avatar Jun 13 '22 17:06 szuecs

@AlexanderYastrebov @universam1 any thoughts on this? It seems like allowing an argument of --target-access-node="" would allow for a seamless upgrade without rebuilding of a target group. This would be a nice option for those of us who want to upgrade to the latest but cant take downtime.

jbilliau-rcd avatar Jun 27 '22 18:06 jbilliau-rcd

I don't like the empty value, I would rather go for explicit --target-access-mode=pre0.12.17 and add upgrade notes.

I think we can also make target-access-mode mandatory to force upgrading users to choose the pill:

  • upgrade from pre 0.12.17 - set --target-access-mode=pre0.12.17 - this way we also support those who are not yet aware of this issue and do not test for downtime before upgrading
  • otherwise (recent deployment or upgraded from pre 0.12.17 with downtime ) set --target-access-mode=HostPort

AlexanderYastrebov avatar Jun 29 '22 17:06 AlexanderYastrebov

works for me; I don't care what you make the field, I just want a way to upgrade without replacing my target groups :)

jbilliau-rcd avatar Jun 29 '22 17:06 jbilliau-rcd

I have created a PR https://github.com/zalando-incubator/kube-ingress-aws-controller/pull/523, please have a look.

AlexanderYastrebov avatar Jul 04 '22 21:07 AlexanderYastrebov

Hold on with update until we test it

AlexanderYastrebov avatar Nov 07 '22 16:11 AlexanderYastrebov

Hold on with update until we test it

It should work though, please report if you test yourself

AlexanderYastrebov avatar Nov 07 '22 17:11 AlexanderYastrebov

Yep, just did. Have 3 existing ingresses, went to v0.14.0 from v0.12.16, as well as adding --target-access-mode=Legacy to my container arguments. It came up, said was was "Updating "internet-facing" stack for 1 certificates / 1 ingresses" but didn't actually do anything I could see, and started polling. I then created a new ingress and it properly attached an existing ALB to it. (beta-whale)

Mon, Nov 7 2022 2:28:55 pm | time="2022-11-07T19:28:55Z" level=info msg="starting /kube-ingress-aws-controller v0.14.0"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Kubernetes API server: "
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Cluster ID: final-dev-418023852230-us-east-1"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="VPC ID: vpc-xxxxxxxx"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Instance ID: i-xxxxxxxx"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Security group ID: sg-xx"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Internal subnet IDs: [subnet-xx subnet-xxx subnet-xxxx]"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Public subnet IDs: [subnet-xxx subnet-xx subnet-xx]"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="EC2 filters: tag:kubernetes.io/cluster/final-dev-xxxxx-us-east-1=owned tag-key=k8s.io/role/node"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Certificates per ALB: 24 (SNI: true)"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Blacklisted Certificate ARNs (0): "
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Ingress class filters: alb-istio-public"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="ALB Logging S3 Bucket: "
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="ALB Logging S3 Prefix: "
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="CloudWatch Alarm ConfigMap: "
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Default LoadBalancer type: application"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Target access mode: Legacy"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Run secrets path refresher every 1m0s, but update once first"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"
Mon, Nov 7 2022 2:31:00 pm | time="2022-11-07T19:31:00Z" level=warning msg="Disabling RouteGroup support because listing RouteGroups failed: resource not found, to get more information https://opensource.zalando.com/skipper/kubernetes/routegroups/#routegroups"
Mon, Nov 7 2022 2:31:02 pm | time="2022-11-07T19:31:02Z" level=info msg="Updating \"internet-facing\" stack for 1 certificates / 1 ingresses"
Mon, Nov 7 2022 2:32:00 pm | time="2022-11-07T19:32:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"
Mon, Nov 7 2022 2:33:00 pm | time="2022-11-07T19:33:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"
Mon, Nov 7 2022 2:34:00 pm | time="2022-11-07T19:34:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"
Mon, Nov 7 2022 2:34:13 pm | time="2022-11-07T19:34:13Z" level=info msg="Updated ingress beta-whale/beta-whale with DNS name kube-ing-lb-3lglgmyp0843-1862683373.us-east-1.elb.amazonaws.com"
Mon, Nov 7 2022 2:35:00 pm | time="2022-11-07T19:35:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"
Mon, Nov 7 2022 2:36:00 pm | time="2022-11-07T19:36:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"
Mon, Nov 7 2022 2:37:00 pm | time="2022-11-07T19:37:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"
Mon, Nov 7 2022 2:38:00 pm | time="2022-11-07T19:38:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"
Mon, Nov 7 2022 2:39:00 pm | time="2022-11-07T19:39:00Z" level=info msg="Updated secret file: /var/run/secrets/kubernetes.io/serviceaccount/token"

Looks good!

jbilliau-rcd avatar Nov 07 '22 19:11 jbilliau-rcd

Then we can close this issue, thanks for the report @jbilliau-rcd !

szuecs avatar Nov 08 '22 20:11 szuecs