bottlerocket-update-operator icon indicating copy to clipboard operation
bottlerocket-update-operator copied to clipboard

Allow update operator to update ASG launch configuration after update

Open rothgar opened this issue 3 years ago • 5 comments

Image I'm using: 328549459982.dkr.ecr.us-west-2.amazonaws.com/bottlerocket-update-operator:v0.1.4

Issue or Feature Request: If I deploy an ASG with bottlerocket 1.0.0 and the update operator updates instances to 1.1.0 I may run in to problems as the ASG scales up.

New instances would have bottlerocket 1.0.0 and if tagged automatically will reboot from the operator once they join the cluster. There should be a way to avoid needing to upgrade/reboot new instances by updating the launch template once a user specified threshold of the ASG has been upgraded (eg 50% or 100%).

By having in place upgrades it may also be hard to track what version of the OS is deployed because describing the instance via the AWS API will show an old AMI ID even though the running OS version is up to date.

rothgar avatar Aug 31 '20 21:08 rothgar

This is definitely concerning to us. We're trying to adopt Bottlerocket but I'm a bit worried about the interaction with cluster-autoscaler as it currently stands:

  1. Autoscaler requests a new node.
  2. Node starts up on older bottlerocket AMI.
  3. Pods begin to schedule onto the node.
  4. Update operator sees node is behind and initiates update.
  5. Update operator drains pods from the node and reboots it. Pods have to reschedule somewhere, but may have trouble finding a node since we don't have excess capacity (thus why the cluster was autoscaling).
  6. Node comes back online and pods are now able to schedule to it.

This could be disrupting, especially in our dev environment where we have a high rate of node churn.

jessebye avatar Feb 04 '21 21:02 jessebye

Our use case is basically the same as @jessebye 's, pretty standard.

We have an EKS cluster, set up with eksctl. It's not under heavy use yet, we're getting there.

The cluster has two node groups at this point. They are managed by eksctl, so basically each group of nodes is an Auto Scaling group.

We have cluster-autoscaler running there, which will spin things up and down as demand for resources fluctuates. It does that by adjusting the number of Desired replicas in the respective auto scaling group when it determines that it needs to add/remove pods for something and it needs more/less k8s nodes to run the pods on. I'm pretty sure everyone knows this, but I'm writing it anyway.

So there would be three "smart" things adjusting the number of nodes in a group.

  • the Auto Scaling group, but it doesn't do much. I believe it doesn't do any scaling unless a node stops responding to EC2 health checks or the cluster-autoscaler changes the group's config.
  • the cluster-autoscaler, which looks at K8s pod usage and starts/stops nodes as needed.
  • the Bottlerocket update operator, which looks at node OS version and can reboot nodes for updates (right?)

The first two work fine together. I'm worried about what will happen if the updater kicks in at the wrong time. There's two high level interactions that I think might cause trouble:

  • Can the update operator cause an EC2 instance to stop responding to EC2 health checks and cause the Auto Scaling group to attempt to replace the instance?
  • What happens when/if the updater decides to update an instance at the same time when the cluster autoscaler kicks in to either add or remove nodes? The most likely scenario is described by @jessebye, but also what happens if the updater initiates an update of an instance that is being drained and stopped by the autoscaler, for example?

So, basically, do the Bottlerocket update operator and cluster autoscaler play nice together?

If the actual cluster config helps, we can probably arrange that.

bgdnlp avatar Feb 06 '21 20:02 bgdnlp

Thanks for raising this issue, and sorry for the latency on this response. The Bottlerocket team has been discussing this issue and we’ve decided that it doesn’t seem quite right for the update-operator to directly modify an AutoScaling group, especially when that ASG may be managed by external automation like CloudFormation. Despite that, we’re still working on providing an alternative method for resolving this problem and will update this issue to provide more details in the future.


Can the update operator cause an EC2 instance to stop responding to EC2 health checks and cause the Auto Scaling group to attempt to replace the instance?

Based on this article, it does seem that reboots could trigger ASGs to take action by terminating hosts slated for updates. I haven’t tested yet, but I could see this being prone to thrashing in cases where the node being updated is freshly launched by autoscaling. One way we could tackle this is by allowing the update operator to be optionally configured to interact with the EC2 AutoScaling APIs and place Bottlerocket nodes into Standby prior to issuing updates. We don’t currently interact with EC2 APIs, so we’ll need to look into an appropriate design for optionally integrating in this way.

cbgbt avatar Aug 26 '21 16:08 cbgbt

There's nothing new to report on this yet, but this is still something that we're looking into improving. Reshuffling the issue to the backlog now that brupop 0.1.x has been replaced.

cbgbt avatar Feb 21 '22 21:02 cbgbt

There's nothing new to report on this yet, but this is still something that we're looking into improving. Reshuffling the issue to the backlog now that brupop 0.1.x has been replaced.

Hey there team! Just wanted to chime in on this as we're also running into this problem and would love to know if there's any updates since Feb on where this is in the scheme of things?

WilboMo avatar May 26 '22 17:05 WilboMo

Great news for a longstanding issue: EC2 Autoscaling has introduced a capability to use SSM parameters directly in launch templates:

  • https://aws.amazon.com/about-aws/whats-new/2023/01/amazon-ec2-launch-templates-aws-systems-manager-parameters-amis/
  • https://docs.aws.amazon.com/autoscaling/ec2/userguide/using-systems-manager-parameters.html

Using this in conjunction with Brupop should resolve this. I'll try it out just to be sure, but I believe we may be able to close this one.

cbgbt avatar Jan 20 '23 01:01 cbgbt

I have confirmed that this works as expected by modifying the launch template of an existing nodegroup in my cluster to use resolve:ssm:. I've also opened an issue against eksctl to add support for specifying SSM parameters during nodegroup creation.

cbgbt avatar Jan 20 '23 21:01 cbgbt

@cbgbt Your ticket in eksctl was just re-opened (https://github.com/weaveworks/eksctl/issues/6174), hopefully this leads to a patch in eksctl to allow for defining this at nodegroup/cluster provisioning.

I tried updating existing nodegroups, but it appears that this change is not allowed on managed nodegroups. Specifically, setting the "resolve:ssm:/aws/service/bottlerocket/aws-k8s-1.25/x86_64/latest/image_id" param seems to be interpreted as defining a custom AMI which is not supported in managed node groups.

When you ran the testing on your end, did you have any luck updating managed node groups to use ssm param?

kitsirota avatar Apr 25 '23 16:04 kitsirota

@kitsirota thanks for chiming in on that issue.

I hadn't tested using managed node groups -- my thinking there was that the AMI launches and lifecycle were intended to be controlled using the mechanisms put in place by EKS. My test cases used unmanaged nodegroups wherein I modified the launch templates after cluster creation.

Do you mind saying a bit more about your use-case? Are there features of managed nodegroups that you're drawn to, even if you were to carve your own path with the update mechanism?

cbgbt avatar Apr 25 '23 19:04 cbgbt

@cbgbt Its actually an interesting point. We were originally on managed nodegroups using Amazon Linux 2 AMIs and the ease of managing these node groups with eksctl established the process we are on now. I have not tested the unmanaged node group deployments and outside of ASGs I am not sure exactly what the limitations are for our use-case (which includes the autoscaler controller).

I'll ping back once we've had a chance to rebuild an environment with unmanaged node groups to determine if this is still applicable to our deployments.

The goal with all of this has been to move to Bottlerocket for the very clear case of being optimized for our workloads with a minimal attack surface. The update operator is the last piece of the equation to ensure all of our nodes are consistently up to date and there is an audit trail to ensure we keep up with these updates.

Without brupop, we're back to the same issues we had with AL2 where we had to manually push out updates across our nodegroups. In an ideal case, this project would help us resolve that issue moving forward.

Thank you again for all your help on this!

kitsirota avatar Apr 26 '23 22:04 kitsirota