containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

ECS tasks re-balancing on autoscaling.

Open parabolic opened this issue 5 years ago • 21 comments

There isn't any automated or easy way to re-balance tasks inside an ecs cluster when scaling events happen ( usually scaling down). It would be a very nice feature to have. There's already a 3 year old issue opened for this feature. https://github.com/aws/amazon-ecs-agent/issues/225

parabolic avatar Dec 12 '18 12:12 parabolic

Yep, we have to do the same thing. We listen to the scaling event triggers using Lambda, then tweak the scaling rules from there.

Orchestrating the orchestration.

skyzyx avatar Dec 12 '18 18:12 skyzyx

Thank you for the feedback. The ECS team is aware of this issue, and it is under active consideration. +1's and additional details on use cases are always appreciated and will help inform our work moving forward.

@parabolic and @skyzyx what are the criteria you use to rebalance tasks? Are you aiming to binpack on as few instances as possible, or spread evenly, or something else?

coultn avatar Dec 12 '18 23:12 coultn

@coultn: We try to spread evenly.

We treat our servers as cattle (not pets), and use Terraform for Infrastructure as Code. Occasionally, we will need to log into the Console, drain connections on a node, terminate it, and let auto-scaling kick-in to replace the node.

One thing that I've noticed (although the Lambda has been working so well, that I haven't tried without it for about 6 months), is that after draining connections off a node, it doesn't actually move the containers over to the remaining hosts. Nor does it move them back when the replacement host comes back up.

So we have a Lambda function configured to listen for the scale events, then trigger the rebalancing that way. It's based on the premise in this blog post: https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/

skyzyx avatar Dec 12 '18 23:12 skyzyx

@coultn Thanks for the answer. As for the details I use the following placement strategies:

type  = "spread"
field = "instanceId"

type  = "binpack"
field = "cpu"

And ideally the tasks should be spread evenly on nodes e.g. 1 task goes to 1 node.This setup should be rebalanced or persisted after scaling events without any intermediate logic.

parabolic avatar Dec 13 '18 09:12 parabolic

Our team has a similar issue. We have alerts setup for our support rotation to notify us when memory is too high on a particular instances. It occasionally happens that we are paged because the alert threshold is breached, when in fact the instances are in fact not spread i.e one instance has several ecs tasks running and another has few or none.

The policy on the ASG is to Maintain memory MemoryReservation ~70%

The placement strategy is

"Field": "attribute:ecs.availability-zone",
"Type": "spread"
"Field": "instanceId",
"Type": "spread"

peterpod-c1 avatar Jun 12 '19 19:06 peterpod-c1

I'm confused/suprised this is not already an automatic feature. Why would I need to setup a lambda if ECS already knows about what's happening.

dfuentes77 avatar Sep 05 '19 21:09 dfuentes77

Do Capacity Providers help with this?

gabegorelick avatar Dec 05 '19 23:12 gabegorelick

Do Capacity Providers help with this?

Yes, if you use managed scaling, capacity providers can help. Here’s how: if you use a target capacity of 100 for your ASG, then ECS will only scale your ASG out in the event that there are tasks you want to run that can’t be placed on the existing instances. So the instances that get added will usually have tasks placed on them immediately - unlike before, where new instances would sometimes sit empty for some time. This isn’t the only rebalancing scenario discussed on this thread, so we still have more work to do.

coultn avatar Dec 06 '19 02:12 coultn

@peterpod-c1

It occasionally happens that we are paged because the alert threshold is breached, when in fact the instances are in fact not spread i.e one instance has several ecs tasks running and another has few or none.

We see the same thing regularly. The reason we find is that if many tasks are started concurrently (esp. during deployment of new code), then it dumps a lot on the same instance with the lowest utilization. By the time the tasks are up and running the instance that had the lowest utilization is now the one with the highest utilization, and sometimes over-utilized resulting in memory pressures that are too high. The workaround is too "slowly" start up new tasks, but that is in conflict with the goal to have the services up-and-running as fast as possible.

I have to add here that we see this issue only in environments where no rolling upgrades of the services are performed, i.e., those where "min healthy percent" is 0. In the environments where "min healthy percent" is greater than 0 this issue does not occur.

lawrencepit avatar Dec 29 '19 04:12 lawrencepit

Do Capacity Providers help with this?

Not quite.

In my tests I started with 4 EC2 instances, each which can take a max of 4 tasks. The distribution at the start was 4,4,4,1. After putting some load to it, it scaled to 9 EC2 instances running 22 tasks for 15+ minutes. I.e., on average it ran only 2.4 tasks per EC2 instance instead of the expected 3.5+. Running 6 EC2 instances should have been enough. After dropping the load, and all the scale in actions finished, we ended up with 6 EC2 instances running 4,4,2,1,1,1 tasks, instead of the expected 4,4,4,1. So in the end it is running at an additional cost of 2 EC2 instances.

What I expect is that scaling in works in a reverse order to scaling out, LIFO style. That way the capacity provider would remove the 2 EC2 instances.

The capacity provider used:

            "managedScaling": {
              "status": "ENABLED",
              "targetCapacity": 100,
              "minimumScalingStepSize": 1,
              "maximumScalingStepSize": 100
            },
            "managedTerminationProtection": "ENABLED"

The placement strategy used:

          {
            "field": "attribute:ecs.availability-zone",
            "type": "spread"
          },
          {
            "field": "instanceId",
            "type": "spread"
          }

lawrencepit avatar Jan 03 '20 03:01 lawrencepit

@lawrencepit Thanks for the detailed example. In this specific case, there are two implicit goals you have that are in conflict with each other. Goal 1, expressed through the instance spread placement strategy, is that you want your tasks to be spread across all available instances. Goal 2, not explicitly expressed in your configuration, is that you want to use as few instances as possible. As you point out, ECS cluster auto scaling with capacity providers cannot solve for both of these simultaneously. In fact, goal 2 is generally impossible to achieve optimally (it's known as the binpacking problem and there are no optimal solvers that can run in any reasonable amount of time). However, if you want to at least do a better job of meeting goal 2, you could remove goal 1 - don't use instance spread, but use binpack on CPU or memory instead. Binpack placement strategy isn't optimal (for reasons stated previously) but it will generally use fewer instances than instance spread.

My last comment is regarding LIFO. Placement strategies don't work that way - instead, they try to maintain the intent of the placement strategy as the service scales in and out. Using LIFO for scaling in would actually cause the tasks to be spread across fewer instances, which is not the intent of the instance spread strategy.

coultn avatar Jan 03 '20 17:01 coultn

@coultn While achieving goal 2 "optimally" may not be possible, we've achieved reasonably good results by having our scaler Lambda drain least-utilized instances when it detects that there is more than one instance worth of excess capacity in the cluster. The argument could be made that this goes against goal 1, but I don't think that's the case. Your definition of that goal is that the tasks should be spread across available instances. If an instance that's surplus to resource requirements is no longer considered "available", then goal 1 is still achieved. Considering that placement strategies are set at the service level and seem to be best-effort, while resource availability is more of a cluster-level concept, that strategy seems to be compatible with both goals.

idubinskiy avatar Mar 10 '20 19:03 idubinskiy

One doubt about bean pack with memory task distribution with services with capacity provider autoscaling, does it stop and start the tasks (if they are in running states?). I have some tasks which runs cronjobs, so does it restart them in-between of execution ? I have enabled managed termination on capacity provider (and my ASG).

ankit-sheth avatar Sep 02 '20 04:09 ankit-sheth

This would be very much appreciated.

We have a setup that roughly represents what's described in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch_alarm_autoscaling.html and in order to replace tasks before instance termination we too use the approach described in the blog post mentioned by @skyzyx https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/. We use the default autoscaling termination policy for determining instances to terminate.

Because we to not wish to maintain this intricate system, it was our intention to start using ESC cluster autoscaling. However, the "Design goal # 2: CAS should scale in (removing instances) only if it can be done without disrupting any tasks (other than daemon tasks)" is totally off point for our use case.

I just can seem to think of the use case where potentially there are e.g. 20 container-instances running with 20 tasks, and a cpu- and memory utilization of, say, 10 %, as would most certainly happen to services that have a placement strategy type spread on field instanceId.

What we would like is more like a best-effort spread (kind of what's described by @idubinskiy) and capability to choose different metrics for scaling than CapacityProviderReservation, such as CPUReservation thresholds.

AletaMP avatar Nov 10 '20 19:11 AletaMP

Surely the ECS managed scaling capacity provider should decide how many instances are required (given the utilisation target, resource requirements and placement requirements) and remove all unnecessary nodes (starting with the least utilised). The placement strategy defined should work within the constraint of the number of available nodes - it's a placement strategy NOT requirement after all.

At the moment, the problem is that ecs auto scaling logic will not consider any node that has a running task on it. That should be optional - with another option to define if they should be recreated elsewhere first - similar to the logic here).

calummoore avatar Jul 21 '21 14:07 calummoore

Hi everyone, I’m a scientist working at AWS on long term plans for offering task rebalancing on ECS. I’m commenting here with some of our potential implementation ideas to get early feedback. Currently, based on simulations, it looks like the most straightforward thing we can do is to make it possible to ‘opt-in’ a service to be rebalanced. This opt-in would make it so that the tasks in the service don’t get the same termination protection that they do currently when the usage/utilization falls below the target. In ECS right now, instances won’t be scaled out of the cluster when even one service task is still running on it, but, under this change, if the utilization of the service falls, an under-utilized instance or instances could be terminated while any additional service tasks still needed when the associated tasks are dropped with the instance could be restarted on another instance. This would definitely only work with stateless types of applications (i.e. you wouldn’t want a task/job to be restarted if state matters). Simulations suggest that this type of approach will lead to better utilization of your cluster (with associated cost savings). The main tradeoff is that you will get more ‘heat’ or usage on the remaining instances, which could impact availability if you have a very high utilization target. An additional benefit to this approach is that you can maintain better balance across instances/AZs and have some automated handling of service weighting during AZ failures (and rebalancing after the AZ is restored). Some questions I have about your goals as our customers:

  • Are we ok restricting this to stateless types of container applications only? Do we need to look at using something like the docker checkpoint command to move running stateful workloads around?

  • My assumption is that most people will want to balance cost and availability with a feature like this (i.e. the goal is to ensure that things are spread out and available without dropping the utilization too much below the target). Is this correct? Is there anybody that just wants to pack things as tightly as possible with the max possible utilization?

nathangeology avatar Aug 09 '22 15:08 nathangeology

Great to hear that someone is looking into this again from AWS' side. I've been looking at our cluster reservation rates, where we are extremely off-target due to the capacity provider rule that it will never scale in if it would disrupt even a single task, and was starting to come to the conclusion that I would need to build a similar type of "rebalancing" lambda as many others here. My general though at this point would be to listen for container instance state changes and terminate any container instance running at least 1 non-daemon task if that task could fit in another container instance. A lifecycle hook would drain the container instance, giving the scheduler a chance to reevaluate where to place each task.

Are we ok restricting this to stateless types of container applications only? Do we need to look at using something like the docker checkpoint command to move running stateful workloads around?

I can only speak for myself, but in our case we would be more than happy to disrupt running tasks if it meant more reasonable packing. Like many others in this thread, we use a combination of AZ spread, spread on instanceid and then binpack, with the intention being that we want to spread out tasks of a service over as many of the available instances as possible, while trying to make sure that the container instances has as high of a reservation rate as possible.

Trying to move state around is in my opinion a fool's errand, and is better solved with a termination lifecycle hook and sigterm handling in the applications. If you haven't already, I would suggest looking at Descheduler. Even with stateless applications, I might still want to have some amount of control to make sure that the same service doesn't get unlucky and end up being excessively interrupted, rather than a binary opt-in, but I wouldn't consider it a must if it makes the feature much more complicated.

My assumption is that most people will want to balance cost and availability with a feature like this (i.e. the goal is to ensure that things are spread out and available without dropping the utilization too much below the target). Is this correct?

In my case that would be correct, but this is already expressed through the placement strategies. Someone using only binpack without any spread strategy would be trying to maximize reservation with no regard for availability (within the bounds of the minimum/maximum healthy percentage service configuration). I would consider doing this in testing or non-production environments if the reservation rate was significantly worse than when combined with the spread strategy.

If you or the team would like to speak to a customer that is extremely keen on this, please reach out through the TAMs for Klarna, and I'd be happy to make the time.

Nevon avatar Aug 14 '22 09:08 Nevon

Thanks for the feedback Nevon, looking forward to discussing this more. The K8's descheduler is definitely on our radar here and I've been talking with our Kubernetes service teams about pod/task rebalancing as well (particularly AWS Karpenter) since this is a general containers application optimization that applies to both containers orchestrators.

In my case that would be correct, but this is already expressed through the placement strategies. Someone using only binpack without any spread strategy would be trying to maximize reservation with no regard for availability (within the bounds of the minimum/maximum healthy percentage service configuration).

Definitely agree that we can leverage expressed placement strategies as a part of re-balancing. There always a bit of tension between adding lots of options/customization and making it easy to setup a container service/app well with fewer and simpler options. This is area where feedback really helps us because we can bucket the controls for task re-balancing into:

  1. This really needs to be option that everyone sets
  2. This is an advanced menu sort of option
  3. This is something we should just do and not bother people with worrying about

My general inclination for ECS is to take care of as much as possible because this should just work well for people with more entry level expertise in containers, but we definitely don't want to leave large customers and power users without any tools to optimize things beyond what simpler options can offer.

Trying to move state around is in my opinion a fool's errand, and is better solved with a termination lifecycle hook and sigterm handling in the applications.

Using termination lifecycle hooks (ala the way spot instances communicate with applications before terminating) and the like is definitely the easier and more straightforward approach. If someone has a need for balancing stateful applications it still might be good to get the engineers thinking about it and assessing what it would take to make it work, but if no one needs that I'm definitely happy to keep it simple hehe.

I might still want to have some amount of control to make sure that the same service doesn't get unlucky and end up being excessively interrupted, rather than a binary opt-in, but I wouldn't consider it a must if it makes the feature much more complicated.

This is going to be a big focus area in the design phase with the engineers because we want rebalancing to help with keeping services balanced across instances/AZs, but we need to do it in such a way that doesn't cause the service to have a failure because the rebalancer happened to be in an AZ that went down (no introducing cross-az dependancies if at all possible!).

Thanks again for the feedback, looking forward to our discussions. Also, anyone else with feedback, questions, etc please feel free to join in. If you’d like to reach out to me directly my alias at AWS is jonesflp.

nathangeology avatar Aug 19 '22 17:08 nathangeology

+1

srknc avatar Jan 20 '23 22:01 srknc

Hi, is there any progress with this issue? Anything in the works on the AWS side?

farbanas avatar Nov 24 '23 15:11 farbanas

+1

taras-mrtn avatar May 10 '24 14:05 taras-mrtn