containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] [request]: Control which containers are terminated on scale in

Open lox opened this issue 5 years ago • 84 comments

We use ECS for auto-scaling build agents for buildkite.com. We are using custom metrics and several Lambdas for scaling the ECS service that runs our agent based on pending CI jobs. Presently when a we scale in the DesiredCount on the service, it seems like it's random which running containers get killed. It would be great to have more control over this, either a customizable timeframe to wait for containers to stop after being signaled or something similar to EC2 Lifecycle Hooks.

We're presently working around this by handling termination as gracefully as possible, but it often means cancelling an in-flight CI build, which we'd prefer not to do if other idle containers could be selected.

lox avatar Jan 21 '19 01:01 lox

What we do is using StepScaling (instead of SimpleScaling), because once the ASG triggers the termination process, it is not blocking any further scaling activities. And in addition we've a lifecycle hook which sets the instance to draining (in ECS) and waits until all tasks are gone (or the timeout). It's based on this blog post: https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/

pgarbe avatar Jan 21 '19 15:01 pgarbe

Thanks @pgarbe, I'm not sure how that helps! I'm talking about scaling in ECS tasks when a ECS service gets a decreased DesiredCount.

lox avatar Jan 25 '19 06:01 lox

What I understood is, that you want to keep the EC2 hosts running as long as some tasks run on it, right? Even when this instance is marked to be terminated by the AutoScaling Group. Actually, you can't really control which EC2 instance gets terminated. But, with the lifecycle hook I mentioned above, you can delay the termination until all tasks are gone.

pgarbe avatar Jan 29 '19 09:01 pgarbe

Apologies if I've done a bad job of explaining myself @pgarbe, that is not at all what I mean. The autoscaling I am talking about is the autoscaling of ECS Tasks in an ECS Service, not the EC2 instances underneath them. As you say, there are a heap of tools for controlling the scale in and out of the underlying instances, but what I'm after are similar mechanisms for the ECS services.

Imagine you have a 100 "jobs" that need processing, and you run "agents" to process those jobs as ECS tasks in an ECS service which is controlled by auto-scaling the DesiredCount. The specific problem I am trying to solve is how to intelligently scale in the ECS tasks that aren't running jobs. Currently setting DesiredCount on the ECS Service seems to basically pick Tasks at random to kill. I would like some control (like lifecycle hooks provides for ECS) to make sure that Tasks finish their work before being randomly terminated.

lox avatar Feb 01 '19 21:02 lox

Ok, got it. Unfortunately, in that case, I can't help you much.

pgarbe avatar Feb 04 '19 08:02 pgarbe

I have this same issue. I am using Target Tracking as my scaling policy and is tracking CPU Utillization. So whenever it does a scale-in, it kills the tasks for that service even if there are clients connected to it. I would love to know if there's a way to implement some kind of a lifecycle hook or a draining status so it will only kill the task when all connections are drained.

travis-south avatar Mar 27 '19 07:03 travis-south

I think there are two things in ECS which can help for connection/job draining before the ECS task stopped.

  • ELB connection draining: If ECS service connects to a ELB target group, ECS will ensure the target is drained in ELB before stop the task.
  • Task stopTimeout: ECS won't directly hard kill the container. Instead, it will send out stop signal and wait for a configurable amount of time before forcefully kill it. The application could gracefully drain in-flight jobs during the shutdown process.

Are they able to handle your case? @lox @travis-south

wbingli avatar Mar 28 '19 07:03 wbingli

Thanks @wbingli, Is there an option for ELB connection draining for ALBs? I can't seem to find it.

As for the stopTimeout, I'll try this and will give feedback.

Thanks.

travis-south avatar Mar 29 '19 00:03 travis-south

Yeah, stopTimeout looks interesting for my usecase too! I was in the process of moving away from Services to ad-hoc Tasks, but that might work.

lox avatar Mar 29 '19 03:03 lox

I don't think stopTimeout is in CloudFormation already, or am I missing something? 😃

travis-south avatar Mar 29 '19 03:03 travis-south

I certainly hadn't heard of it before!

lox avatar Mar 29 '19 03:03 lox

@lox @travis-south The documentation says startTimeout and stopTimeout is only available for tasks using Fargate in us-east-2. That's pretty narrow availability! 😄

This parameter is available for tasks using the Fargate launch type in the Ohio (us-east-2) region only and the task or service requires platform version 1.3.0 or later.

whereisaaron avatar Mar 29 '19 04:03 whereisaaron

I see, well, I think I'll resort to ECS_CONTAINER_STOP_TIMEOUT for now to test it.

travis-south avatar Mar 29 '19 05:03 travis-south

@travis-south I think here is the document to configure ELB connection draining, ELB Deregistration Delay. There is no need for a configuration on ECS service side, it will always respect ELB target draining and stop the task once target draining completed.

The stopTimeout feature is pretty new, it's launched on Mar 7.

As for the availability, it should be available to all regions if using EC2 launch type, agent version 1.26.0+ required. The document is kind of misleading to say "This parameter is available for tasks using the Fargate launch type in the Ohio (us-east-2) region only", it actually means "For tasks using Fargate launch type, it's only available in Ohio (us-east-2) region only and requires platform version 1.3.0 or later".

wbingli avatar Mar 29 '19 17:03 wbingli

@wbingli thanks for the explanation. I'll try this. At the moment, my deregistration delay is 10 secs, i'll try to increase this and see what happens.

travis-south avatar Apr 01 '19 05:04 travis-south

Hi everyone, the stopTimeout parameter is available for both ECS and Fargate task definitions. It controls how long the delay is between SIGTERM and SIGKILL. Additionally, if you combine this with the container ordering feature (also available on both ECS and Fargate), you can control the order of termination of your containers, and the time each container is allowed to take to shut down.

We are in the process of updating ECS/Fargate and CloudFormation docs to reflect the fact that these features are available in all regions where those services are available.

coultn avatar Apr 16 '19 15:04 coultn

How would one disable SIGKILL entirely @coultn? Sometimes tasks might take 30+ minutes to finish.

lox avatar Apr 16 '19 22:04 lox

You can't disable SIGKILL entirely, but you can set the value to a very large number (on ECS; there is a 2 minute limit on Fargate).

coultn avatar Apr 16 '19 22:04 coultn

I tried increasing my deregistration delay to 100 secs and it made thing worst for my case. I receive a lot of 5xx errors during deployments.

travis-south avatar Apr 17 '19 09:04 travis-south

update: the stop timeout parameter is now documented in CloudFormation (see release history here: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/ReleaseHistory.html). However, the original issue is about controlling which tasks get selected for termination when a service is scaling in due to a scaling policy action. Per-container stop timeouts can help with that but won't provide a complete solution.

coultn avatar Apr 19 '19 22:04 coultn

This basically brings things up to parity with lifecycle hooks on EC2, so I'd say this pretty much addresses my original issue. Happy to close this out, thanks for your help @coultn.

lox avatar Apr 19 '19 22:04 lox

The stop timeout does provide an ECS equivalent for EC2's termination lifecycle hook. However ECS is still missing an equivalent of EC2's instance protection, which would allow solving exactly the problem in this issue's title.

Using EC2 instance protection, you can mark some instances in an autoscaling group as protected from scale in. When scaling in, EC2 will only consider instances without scaling protection enabled for termination. By manipulating the instance protection flag, an application can control exactly which EC2 instances are terminated during scale-in. If ECS would add an equivalent "task protection" flag for ECS tasks, problems like the one @lox described would have a straightforward solution. You'd simply set the protection flag to on for tasks that are busy, and turn it off when a task is idle. When an ECS service was told to scale in, it would only be allowed to kill tasks with protection turned off.

I've been wrestling with a similar problem recently, and it would be very helpful if AWS would add a "task protection" feature.

ajenkins-cargometrics avatar Jul 30 '19 13:07 ajenkins-cargometrics

is there a maximum value for stopTimeout ?

MaerF0x0 avatar Nov 26 '19 01:11 MaerF0x0

In my case I preffered stopTimeout to be 0 so so the container will be killed immeadetly, but apperently the minimum value is 2 seconds.

What is the reason of not allowing less than 2 seconds values? Where I can see documentation on the limits?

shmulikd9 avatar May 11 '20 18:05 shmulikd9

This would a great value addition since ECS-EC2 tasks are usually run for processes that need to be always up and running. And in scenarios where the process cannot be stopped for hour(s) due to tasks running from the time of SIGTERM being called, this can mean incompleted tasks. Would have been wonderful to see a managed solution to this, instead of us having to build an entire architecture around this and maintain the lifecycle of the process ourselves.

kaushikthedeveloper avatar Sep 28 '20 19:09 kaushikthedeveloper

@ajenkins-cargometrics I 100% agree with your suggestion, ECS tasks for a job should able to enable "task protection". Can I ask that how you could use that "instance protection" for EC2? Because in ECS you cannot know where the task will be placed by ECS? Or you are enabling "instance protection" inside of the docker container with AWS CLI?

Zogoo avatar Oct 05 '20 05:10 Zogoo

Is there any solution how to work around it? We have a problem to scale down Celery workers, which are running on ECS Fargate. When AWS decides to shutdown a container, this container can still be running a long-running task, which is lost then. Without this feature ECS Fargate seems quite useless for usage with workers, which run some long-running jobs.

Bi0max avatar Feb 08 '21 14:02 Bi0max

We are looking into this issue and had a couple of questions. How do you track which tasks are still processing jobs?

There are potentially two ways to address the ECS Service scale-in task issue. Option 1. We will have a customer termination policy hook where you can hook a lambda function solve the issue? So as part of the service definition, you can add a lambda arn to call during scaling-in. This will wait for the lambda to finish up to a certain time and return the ordered tasks to be evicted. This way you can always order tasks that need to be removed.

Option 2 is to have task eviction priority value that can be set for each task. ECS will always remove the lower priority tasks first before killing the higher priority ones. This way during scale-in, you will always have your core tasks running.

We can add default policies to both these options to select default termination policy to oldestTask or newestTask like ASG. Will any combination of these options work as a solution?

dkallakuri avatar Feb 28 '21 02:02 dkallakuri

@dkallakuri, I did not fully understand, what you meant by Option 2. Will it be possible to set the eviction priority from boto3, for example? And would it be possible to set priority in such a way, that no container is allowed to scale-in temporarily?

Currently I am using the following logic for my Celery workers in containers: I run a separate autoscaler script, which checks all the celery workers. If it decides to down-scale (based on task queue length), it will firstly unsubscribe the Celery worker from all queues. Then, when worker has no jobs running, autoscaler script simultaneously kills the worker process (which leads to ECS task being stopped) and updates the corresponding ECS service (reduces the "desired count" of tasks).

This solutions seems to work, but feels bad and most probably not fail-proof. It would be good, if instead of the last step (killing worker in ECS task and updating ECS service), it would be possible to just set certain eviction priority. But it should be possible to set eviction priority to never kill some tasks (even if down-scaling is required since a long-time).

I hope, that my input helps.

Bi0max avatar Mar 01 '21 19:03 Bi0max

Thank you @Bi0max for the detail. Since ECS does not know the status of the jobs within tasks, I am suggesting the following ability. Users can setup termination policy to oldest task, newest task or custom. In custom mode, you will get a lambda call with list of tasks that are candidates for scaling in and your function will return the specific tasks to be moved. ECS will the remove those tasks. The issue with setting a task to never be killed can become cumbersome and open up corner cases in the scheduler.

Does what I described above address your use case?

dkallakuri avatar Mar 02 '21 05:03 dkallakuri