containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS/Fargate] [request]: Allow stopTimeout to be configured for ondemand tasks

Open ghost opened this issue 5 years ago • 16 comments

Based on this comment: https://github.com/spring-projects/spring-boot/issues/4657#issuecomment-161354811, if we were to implement graceful termination on the application side, it would really help if stopTimeout would allowed to be configurable - at least for on-demand fargate tasks.

Currently the max value is 120s which is not sufficient for all usecases.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request What do you want us to build?

Which service(s) is this request for? This could be Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? What outcome are you trying to achieve, ultimately, and why is it hard/impossible to do right now? What is the impact of not having this problem solved? The more details you can provide, the better we'll be able to understand and solve the problem.

Are you currently working around this issue? How are you currently solving this problem?

Additional context Anything else we should know?

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

ghost avatar Aug 07 '20 16:08 ghost

Implementing graceful termination on the application side we are facing the same issue.
We have an ECS service with an ALB, Fargate tasks and autoscaling that increase and decrease the desiredCount.

When an ECS task receive SIGTERM (docker stop), it should get a chance to complete the ongoing work before being forcibly killed.
Unfortunately for our use case a maximum value of 120s for stopTimeout is not enough.

matteomazza91 avatar Mar 24 '21 10:03 matteomazza91

I run long running tasks on ECS and would like to run them in fargate. I would like for them to have a chance to finish which would mean a multi-hour stop timeout.

bryanculbertson avatar Apr 05 '21 21:04 bryanculbertson

The 2 minute limit is too short for many use-cases. With this limit in place, AWS fargate auto-scaling is rendered useless for any system that cares about graceful shutdown.

vineetraja avatar Apr 09 '21 08:04 vineetraja

Agreed with other posters. We have long running Fargate tasks that listen to a queue for picking up tasks, and tasks have a potentially unbounded processing time. Yes, they'll pick up the message again from the queue after timeout if the tasks dies (once visibility expires), but I'd prefer that the task be allowed to finish and gracefully shut down.

With a 2 minute limit, we've been forced to abandon AWS's default auto scaling, and create a lambda that checks for scaling every minute, and sends a http request to containers to shut down instead of using sigterm.

RicePatrick avatar May 08 '21 01:05 RicePatrick

Same problem here. This renders Fargate useless for our application. Seems we need to use EC2 instances, where this limit does not exist?

We are paying for running containers even whey they are shutting down, why cannot we set the stopTimeout to whatever value we like?

marc-guenther avatar Aug 06 '21 19:08 marc-guenther

Encountering a similar issue. We have tasks that we'd like to drain / shutdown slower than in two minutes. Currently that means that we're migrating from ECS Services and Autoscaling and will need to manually manage the tasks both during deployments and autoscaling. Using the RunTask and some scripts to hold it all together.

Being able to extend the StopTimeout to 1-2 hours is the only reason why the current setup doesn't work for us.

GytisZ avatar Aug 19 '21 10:08 GytisZ

Bumping this! - Such a needed feature.

keirw2022 avatar Oct 12 '21 09:10 keirw2022

StopTimeout should be set by users as much they want.

satya-500 avatar Nov 08 '21 08:11 satya-500

StopTimeout - max can be only 2 mints for SIGTERM. For more details go through the this link. But this solution will not work for stateful operations as it's depends on our business logic.

May I know when we can expect full pledge solution from AWS?

maddipati-srinivas avatar Nov 23 '21 11:11 maddipati-srinivas

We moved our (long-running) batch processing application from ECS on Fargate to ECS on EC2 so that we could manage the termination behavior and extend it as long as necessary in order to properly let our batch jobs complete and drain safely without loss of work. 2 minutes is woefully insufficient. However, this has lead to significantly increased DataDog monitoring costs (from ~$1.40/task to $56/task), which cannot be borne in our budget. We'd be happy to keep the tasks on Margate, if the StopTimeout could be extended as long as necessary.

mdomsch-seczetta avatar Jul 19 '22 20:07 mdomsch-seczetta

Yes we have the exact same problem. I am not able to use ECS for one of our major applications because I need to allow a Fargate instance much more time than 2 minutes to shut down.

craigify avatar Jan 26 '23 19:01 craigify

https://github.com/aws/containers-roadmap/issues/256#issuecomment-1549434318 notes that ECS Task Scale-in Protection can now be set. However, that does not solve the problem. This prevents SIGTERM from reaching a running task, so the 2-minute SIGKILL timer never starts. But it also removes the signal (SIGTERM) that the task should stop picking up new jobs to run. Many task servers, such as sidekiq, can work on multiple jobs simultaneously. If one job is running (thus scale-in protection set), if there's another job in a queue ready to be processed, the task could pick up that job too, when we only want to wait for the first job to complete, not start any new jobs on this task. Now that we're allowed to use ECS Task Scale-in Protection to keep a task alive indefinitely, we should similarly be allowed to prevent ECS from sending SIGKILL after 2 minutes.

matt-domsch-sp avatar May 16 '23 12:05 matt-domsch-sp