kops icon indicating copy to clipboard operation
kops copied to clipboard

ASG Warmpool instances join before Lifecycle hook is in effect

Open jim-barber-he opened this issue 9 months ago • 12 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Client version: 1.28.5 (git-v1.28.5)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Server Version: v1.28.9

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Create a number of new instance groups, some of which have the spec.warmPool defined with spec.warmPool.enableLifecycleHook: true set. In my case I was creating 9 instance groups of which 3 of them had Warmpools enabled. All instances were created by passing their manifests to kops create -f, then once that is done for all of them a kops update cluster --admin --yes command is run to put them into effect.

5. What happened after the commands executed?

All the instance groups were created fine and it appeared to be without issue...

But about an hour later when some of the instances that were in the warmpool were pressed into service, they joined the cluster as expected and started serving web traffic for our application until the ASG terminated them after they had been in service for 10 minutes. Investigation via the AWS CloudTrail logs showed these EC2 instances had tried to run the CompleteLifecycleAction event a number of times when they joined the Warmpool all producing a ValidationException error with the message No active Lifecycle Action found with instance ID i-xxxxxxxxxxxxxxxxx. So they have come up before the Lifecycle hook was in place.

Then later when put into service, the Lifecycle hook is now in place on the ASG, but these instances do not try to notify it, so the ASG assumes they are unhealthy after 10 minutes and terminates them. In our case it killed our webservers causing a complete outage of our application for a while.

6. What did you expect to happen?

Instances that have joined the Warmpool should be properly executing the ASG Lifecycle hook notification when pressed into service so that the ASG won't terminate them after 10 minutes.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I am unable to reproduce this problem if I just create one new instance group at a time. In that case the Lifecycle hook manages to be in effect before new instances need it.

But I am able to reliably reproduce it when I create a number of instance groups at once (9 of them since I just reproduced the problem we had in production). I guess you need a lot of them to cause some AWS API server throttling or something?

I have a Slack topic detailing the issue: https://kubernetes.slack.com/archives/C3QUFP0QM/p1715733980241579

The problem would be fixed if the ASG didn't add instances to it until all lifecycle hooks are created.

jim-barber-he avatar May 21 '24 04:05 jim-barber-he