upgrade-manager RollingUpgrade not consistent

Is this a BUG REPORT or FEATURE REQUEST?: Bug

What happened: A RollingUpgrade created after a modification to launch templates via instance-manager did not detect any nodes that need to be recycled. To resolve the RollingUpgrade is deleted, and instance-manager is restarted, recreating the RollingUpgrade

What you expected to happen: Node is detected as being out-of-sync and replaced.

How to reproduce it (as minimally and precisely as possible): I can't consistently recreate this, but this happens often during our monthly patching cycle.

Anything else we need to know?:

Environment:

rolling-upgrade-controller version: 1.0.2
Kubernetes version : 1.19.7 | 1.20.10

$ kubectl version -o yaml

Other debugging information (if applicable):

RollingUpgrade status:

$ kubectl describe rollingupgrade <rollingupgrade-name>

The rollingupgrade has already been replaced. I did see that the state was "completed".

controller logs:

$ kubectl logs <rolling-upgrade-controller pod>

2021-10-12T00:16:48.561Z	INFO	controllers.RollingUpgrade	admitted   new rolling upgrade	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
--
2021-10-12T00:16:55.636Z	INFO	controllers.RollingUpgrade	scaling   group details	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.736Z	INFO	controllers.RollingUpgrade	checking   if rolling upgrade is completed	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.736Z	INFO	controllers.RollingUpgrade	no   drift in scaling group	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.836Z	INFO	controllers.RollingUpgrade	rolling   upgrade ended	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}
2021-10-12T00:17:25.837Z	INFO	controllers.RollingUpgrade	rolling   upgrade ended	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}
2021-10-12T00:17:47.935Z	INFO	controllers.RollingUpgrade	rolling   upgrade ended	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}

***** RollingUpgrade is delete and instance-manager restarted *****

2021-10-12T00:20:59.590Z	INFO	controllers.RollingUpgrade	rolling   upgrade resource not found, deleted object from admission   map	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:21:59.133Z	INFO	controllers.RollingUpgrade	admitted   new rolling upgrade	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	scaling   group details	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	checking   if rolling upgrade is completed	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	drift   detected in scaling   group	{"driftedInstancesCount/DesiredInstancesCount":   "(1/1)", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	selecting   batch for rotation	{"batch size": 1, "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	rotating   batch	{"instances": ["i-0017fd066bbfd0e32"],   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	setting   instances to in-progress	{"batch":   ["i-0017fd066bbfd0e32"], "instances(InService)":   ["i-0017fd066bbfd0e32"], "name": "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:02.032Z	INFO	controllers.RollingUpgrade	setting   instances to stand-by	{"batch": ["i-0017fd066bbfd0e32"],   "instances(InService)": ["i-0017fd066bbfd0e32"],   "name": "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:02.433Z	INFO	controllers.RollingUpgrade	operating   on existing rolling upgrade	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z	INFO	controllers.RollingUpgrade	scaling   group details	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z	INFO	controllers.RollingUpgrade	checking   if rolling upgrade is completed	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z	INFO	controllers.RollingUpgrade	drift   detected in scaling   group	{"driftedInstancesCount/DesiredInstancesCount":   "(1/1)", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}

Oct 12 '21 10:10 preflightsiren

I have 2 ideas where the fault could be.

in IsScalingGroupDrifted() - awsprovider.SelectScalingGroup(r.RollingUpgrade.ScalingGroupName(), r.Cloud.ScalingGroups) might return &autoscaling.Group{} leaving for _, instance := range scalingGroup.Instances to skip without doing any work.
IsInstanceDrifted() might not detect any changes due to error in aws request / caching.

unfortunately there's no more verbose logging I can enable to validate. I'm going to try setting forceRefresh:true in the instanceGroups, and this might help diagnose into which is more likley.

Oct 12 '21 10:10 preflightsiren

This can be due to caching, is the modification to launch template happens outside of instance-manager? i.e. manually, or through some other process? When instance-manager makes a change (e.g. you change some configuration), caching is not in the picture since that write invalidates the cached item.

upgrade-manager is flushing the cache before every upgrade, so when a CR is received the cache is clean as far as I know.

Oct 12 '21 16:10 eytan-avisror

@preflightsiren can you add your instance-group YAML and the resulting rolling-upgrade YAML as well

Oct 12 '21 22:10 eytan-avisror

Interesting. The changes are via instance-manager. We just changed it to use forceRefresh: true which tells me there's something inside IsInstanceDrifted() which is acting weird.

I'll try and get more details when we have a failure.

Oct 13 '21 03:10 preflightsiren

had this happen today during patching I've attached an example of a working and affected InstanceGroup and RollingUpgrade yaml. 309.zip

I think more interesting is the logs for the broken RollingUpgrade



2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade checking if rolling upgrade is completed {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
--
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-005a537c368db9bc8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-052105961e93c27df", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-056a89fcf2f036606", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0631bae2cdbaded68", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0692b82e4cd26c1d8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0a6180cc3f1658bb9", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0f7155d519873057f", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade no drift in scaling group {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}

Oct 18 '21 05:10 preflightsiren

looks like upgrade.go#457 is being hit, meaning kubeprovider.SelectNodeByInstanceID(instanceID, r.Cloud.ClusterNodes) is failing

Oct 18 '21 05:10 preflightsiren

finally got around to spending some time debugging, it looks like the r.Cloud state becomes corrupt / out of date. the raw log:

{"cluster":"uw2p-akp-b3","kubernetes":{"container_id":"docker://97face6b823c2bb6efac2b7a5c3f44f6401031f6327a7d99dc5c90aab93c65a3","container_image":"493203180918.dkr.ecr.us-west-2.amazonaws.com/akp-mirror/docker.io/keikoproj/rolling-upgrade-controller:v1.0.2-seb","container_name":"rolling-upgrade-controller","host":"ip-172-24-138-175.us-west-2.compute.internal","labels":{"app":"rolling-upgrade-controller","pod-template-hash":"76cd7b4ccd"},"namespace_name":"upgrade-manager","pod_id":"bee06a61-5ae5-4141-b27d-dadd0bae0eeb","pod_ip":"100.65.225.107","pod_name":"rolling-upgrade-controller-76cd7b4ccd-rm6tj","pod_owner":"ReplicaSet/rolling-upgrade-controller-76cd7b4ccd"},"log":"2022-02-14T01:18:54.252Z\tINFO\tcontrollers.RollingUpgrade\tNil Node details\t{\"instanceID\": \"i-00b789cfd28cd5942\", \"clusterNodes\": [\"ip-172-24-137-169.us-west-2.compute.internal\", \"ip-172-24-138-30.us-west-2.compute.internal\", \"ip-172-24-137-205.us-west-2.compute.internal\", \"ip-172-24-139-28.us-west-2.compute.internal\", \"ip-172-24-138-231.us-west-2.compute.internal\", \"ip-172-24-139-241.us-west-2.compute.internal\", \"ip-172-24-138-206.us-west-2.compute.internal\", \"ip-172-24-136-136.us-west-2.compute.internal\", \"ip-172-24-141-90.us-west-2.compute.internal\", \"ip-172-24-140-216.us-west-2.compute.internal\", \"ip-172-24-136-181.us-west-2.compute.internal\", \"ip-172-24-137-217.us-west-2.compute.internal\", \"ip-172-24-139-39.us-west-2.compute.internal\", \"ip-172-24-136-125.us-west-2.compute.internal\", \"ip-172-24-137-230.us-west-2.compute.internal\", \"ip-172-24-136-128.us-west-2.compute.internal\"]}","stream":"stderr"}

the node being rotated: ip-172-24-138-175.us-west-2.compute.internal providerID: aws:///us-west-2b/i-00b789cfd28cd5942

list contains 15 nodes, cluster has 145 nodes present, 21 uncordoned.

I could continue to dig, but it may just be worthwhile periodically restarting upgrade-manager like it's a NT2k program :)

Feb 14 '22 04:02 preflightsiren

@preflightsiren can you check out #317 and see if that fixes the issue? I believe it should

Feb 19 '22 00:02 eytan-avisror

Maybe. It seems unlikely as the cache contains the list of nodes in the cluster, not information from ec2. Is there a connection I'm not aware of? Ill pull this patch into my debug branch and see if we still see the error.

Feb 19 '22 00:02 preflightsiren

The cache contains the launch template / versions, if a new version is created and the controller is unaware it would essentially skip the upgrade

Feb 19 '22 00:02 eytan-avisror

Ok ive deployed the patched version to our environment. 🤞 We'll know with more certainty in about 2-3weeks

Feb 21 '22 22:02 preflightsiren

Even with #317 we still have the issue. I added some debug logging and can see:

Nil Node details	{"instanceID": "i-05d7694376d812689", "clusterNodes": []}

the r.Cloud.ClusterNodes is empty

Apr 12 '22 05:04 preflightsiren

Not sure how this is the case @shreyas-badiger any idea? Could this be a bug with how we use Event Filters here:

https://github.com/keikoproj/upgrade-manager/blob/8e0f67db323ec5b7bea0fd4d9f96d23f499e7e66/controllers/rollingupgrade_controller.go#L197-L232

Do we ever do an initial list nodes when the controller starts up? if we don't it will take the controller 30-60 seconds to fill up the cache from node events initially.

@preflightsiren is it possible that this happens when the controller restarts? or is it spun up right when an upgrade is submitted?

Apr 12 '22 05:04 eytan-avisror

I just checked and that's not the case, even with event filters the watch lists the nodes immediately when it's registered. Let me dig further and see if I can understand why this is happening.

Apr 12 '22 05:04 eytan-avisror

is it possible that this happens when the controller restarts? no, the controller was not restarted during the upgrade.

I have an example of a node that was joined to the cluster long before the RollingUpgrade was created, was not found the list of ClusterNodes and the ClusterNodes was not nil (it had 143 nodes in the cluster).

hope this adds some more flavour :)

Apr 12 '22 08:04 preflightsiren

Even with #317 we still have the issue. I added some debug logging and can see:
Nil Node details	{"instanceID": "i-05d7694376d812689", "clusterNodes": []}
the r.Cloud.ClusterNodes is empty

@preflightsiren can you confirm if the nodes were "Ready" or "Not Ready" at this point in time? Interested to know whether nodes were already part of the cluster or not.

Apr 15 '22 18:04 shreyas-badiger

Sorry for the delay in the response, we've been busy running debugging during the upgrade process. During the upgrade process there's always a mix of ready and unready nodes, there's always ready nodes in the cluster servicing requests.

The one patch we've applied that seems to have the best result is setting --max-parallel=1 since we've applied that every environment and region has completed without issue.

Sep 06 '22 01:09 preflightsiren

Bump,I've ran into this is if the new node coming up never joins the cluster (bad AMI update etc) but the EC2 is valid member of the ASG. In my case I reverted the bad ami but the RU is stuck since the new node never joins so it can't continue since The cache map of kube nodes and instance is out of whack. Force removing the the instances from ASG/ and recycling the controller (to clear the cache map) and redeploying the CRDs works. Love the project btw. Issue happens every now and then.

Dec 14 '22 22:12 CAR6807

upgrade-manager upgrade-manager copied to clipboard

RollingUpgrade not consistent

upgrade-manager
upgrade-manager copied to clipboard