containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[service] [request]: Customer defined capacity unit

Open davesade opened this issue 1 year ago • 0 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request Customer could have an option to define it's own "capacity unit", free at his will and unique needs - example 1 CU = 1 CPU + 2 GB RAM. Then on autoscaling group level, one defines list of possible instance types with various CPU and memory available - customer will set "capacity" on each instance type, based on customer's capacity unit. Example - t2.small = 1 CU, t2.medium = 2 CU, c5.xlarge = 3 CU, c5.2xlarge = 4 CU etc. Rule is simple - select instances based on common denominator, which can be anything of course.

When new ECS tasks are about to start, it will report, how much capacity is actually needed and autoscaling group will select closest instance type from the list of instance type overrides, to satisfy needs of new tasks.

Which service(s) is this request for? ECS, Capacity Providers

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We understand that Capacity Provider for ECS cluster is ideal for merely uniform workloads, ie. if Tasks are having similar requirements on CPU and memory side. In case you'd have Task definitions with wider range of requirements (starting with 0.25 CPU / 128 MB RAM, ending with 8 CPU / 16 GB RAM), autoscaling will become inefficient.

We observed following behaviour (and AWS support confirmed it is designed that way): our autoscaling group for ECS capacity provider had 2 instance types defined - small and large. We got various task requirements: small task can run on small instance or for example 4 small tasks can run on single large instance. However, when large task was about to run, capacity provider requested small instance first, as it was first in the list of overrides. As small instance couldn't satisfy needs of large tasks, that request was rejected and process repeats up to 4 times. If after 4 attempts no large instance was found, alarm on Capacity Provider Reservation went silent, attempt to request new instance was suspended and large task remain in PROVISIONING state forever.

If one removes small instance from autoscaling group overrides, then large task will run as expected - however, in such a case if one small task is about to run, autoscaling group will deliver large instance, which is heavily underutilised. That is expensive and inefficient.

Having an option to manually define capacity unit on task and instance type level would increase quality of Capacity Provider.

Are you currently working around this issue? Smallest instance type in autoscaling group override must be capable of satisfying largest task.

davesade avatar Jul 28 '22 09:07 davesade