containers-roadmap Support Task execution timeout(maximum-lifetime for a container) in ECS fargate

Summary

ECS does not currently support a task execution timeout so that when a task exceeds more than certain period of time, the task must be stopped automatically like how AWS Batch has job timeouts. The task definition does not have a parameter to enforce a task/container execution timeout that will automatically trigger the container to stop after the set time.

Use-case example from a customer: I have a NLP model training job I want to run in a fargate container triggered by a lambda function. At some time, a bug might be introduced in the training code that would cause it to run indefinitely. I don't want to accidentally have those tasks piling up and have 50 tasks running for a couple weeks before we notice. That could have a cost implication. Is there a native way to kill a container if it hasn't exited on its own before a certain time?

Can this be considered as a feature request?

Nov 12 '19 16:11 nitheesha-amzn

Thanks @nitheesha-amzn for submitting this for me! As we discussed in the ticket, a more native approach would be to have AWS Batch support Fargate launch type. This seems to be kind of a force-fit edge case for ECS.

Nov 12 '19 17:11 danieladams456

moving this over to the container roadmaps as an ecs feature request

Nov 12 '19 20:11 adnxn

I can see another use case here, as mentioned in https://github.com/aws/containers-roadmap/issues/232

applies to both Fargate and EC2 execution methods for ECS, not just Fargate
when scheduling tasks using cron-style syntax with Cloudwatch Events/EventBridge, you would want to ensure tasks don't run indefinitely. if they did, and you have them set to spawn regularly, you would eventually exhaust cluster resources/service limits, effectively DoS'ing your AWS account

Jan 17 '20 20:01 CpuID

A problem I'm seeing is a task that is expected to be relatively short lived (few hours at most, but typically minutes) due to some bug is 'stuck' and still running after days.

It would be great to have a back stop that kills any jobs after X hours. Looking at the console with hundreds of tasks is hard to find the problem ones.

Sep 14 '20 19:09 apsoto

Would like to stop a bastion host after a period of time.

Sep 16 '20 01:09 rcollette

@adnxn any updates re where this sits on the roadmap? :)

Oct 23 '20 22:10 CpuID

+1

Oct 29 '20 18:10 max-grosch

any updates re where this sits on the roadmap? :)

/ping @coultn

Oct 29 '20 18:10 adnxn

Bump! 🥓

Oct 29 '20 21:10 CraigHead

What do you think about adding a "essential" container to the task with a sleep XX. When the sleep ends ECS will stop the task then.

Nov 03 '20 16:11 vbarba

Interested to hear any experiences of using AWS Batch to achieve this, despite the objections. Also, see https://github.com/aws/containers-roadmap/issues/232.

Jan 08 '21 12:01 deyvsh

I wonder if this can be done through AWS Config Rule. Event bridge cron rule will also do the same. I guess. Run a lambda function through a config rule every hour to stop containers started before certain time (i.e for an hour ago).

I have the same issue, want to stop container after an hour, but not sure how to do it. I need to do as part of several different stacks so cluster and task ids will be different. Best if its part of task definition, otherwise need to create a config rule to target all the different cluster/tasks.

Feb 19 '21 00:02 sky4git

@sky4git An AWS Step Function State Machine is one solution to your use case. It can 'monitor' and take actions based on a time window. Also you can create CloudWatch alarms to monitor failed executions and timeouts.

I wonder if this can be done through AWS Config Rule. Event bridge cron rule will also do the same. I guess. Run a lambda function through a config rule every hour to stop containers started before certain time (i.e for an hour ago).

I have the same issue, want to stop container after an hour, but not sure how to do it. I need to do as part of several different stacks so cluster and task ids will be different. Best if its part of task definition, otherwise need to create a config rule to target all the different cluster/tasks.

Mar 24 '21 16:03 gregorydickson

Here is another use case which would benefit from this requested feature: We run end-to-end tests on an ECS Task with Fargate. If, for a bug, a test gets stuck, the task could potentially run forever. I haven't found any way to set a CloudWatch alarm for task duration.

Apr 23 '21 07:04 paolofulgoni

I'd love this feature.

Some of our tasks leak memory very slowly. It'd be great to be able to set a maximum task life of ~3 months, to keep the memory leakage small. In general, it seems like a modern best practice to reap your processes fairly early, to not allow very long-lived processes in your systems. It would be great if Fargate ECS could assist with this. We would also love it if regular ECS supported this.

May 20 '21 15:05 rektide

I'd love this feature.

Some of our tasks leak memory very slowly. It'd be great to be able to set a maximum task life of ~3 months, to keep the memory leakage small. In general, it seems like a modern best practice to reap your processes fairly early, to not allow very long-lived processes in your systems. It would be great if Fargate ECS could assist with this. We would also love it if regular ECS supported this.

@rektide I think that you could use a Step Function State Machine to set a max time and shut down the ECS task.

May 24 '21 14:05 gregorydickson

+1

Jul 14 '21 15:07 citrusoft

Until this is natively implemented in ECS Scheduled Tasks, here are some options you have to implement timeouts:

Wrap the command of your job container with timeout (assuming it's available in the container). e.g. timeout X mycommand arg1 arg2; STATUS=$?; if [ $STATUS -eq 124 ]; then echo 'Job Timed Out!'; fi; exit $STATUS
Add an essential container to the task definition with command sleep X. When it times out, the whole task exits.
Use external entities (such as Step Functions) to monitor and stop tasks that exceed a max lifetime.
Just add a CloudWatch Alarm that notifies you when some tasks have ran for too long, and stop them manually.
Use Kubernetes instead of ECS. Seriously, no native timeouts on scheduled tasks?

Aug 27 '21 11:08 TarekAS

This would be a great feature!

Jul 21 '22 17:07 tim-x-y-z

@TarekAS which metrices did you used to set the Cloudwatch Alarm?

Feb 08 '23 06:02 rahul799

I don't understand how is it possible that such a basic feature is not available

Feb 09 '23 10:02 ciurlaro42

cc @ofiliz

Feb 09 '23 12:02 dims

This is a way to introduce a timeout for ECS tasks. Feedback welcome.

https://it20.info/2023/03/configuring-a-timeout-for-amazon-ecs-tasks/

Mar 20 '23 17:03 mreferre

@mreferre thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your blog, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !

Mar 20 '23 19:03 jeroenhabets

Classification: Public

Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.

From: Jeroen Habets @.> Date: Monday, March 20, 2023 at 12:08 PM To: aws/containers-roadmap @.> Cc: Hunt, Tommy @.>, Comment @.> Subject: Re: [aws/containers-roadmap] Support Task execution timeout(maximum-lifetime for a container) in ECS fargate (#572)

CAUTION: EXTERNAL SENDER!

This email was sent from an EXTERNAL source. Do you know this person? Are you expecting this email? Are you expecting any links or attachments? If suspicious, do not click links, open attachments, or provide credentials. Don't delete it. Report it by using the "Report Phish" button.

@mreferrehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmreferre&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ohqfmLgy9g2Xr1DsTAqNfvXOecT%2FIvqfZUC%2FZf2lpqM%3D&reserved=0 thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your bloghttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit20.info%2F2023%2F03%2Fconfiguring-a-timeout-for-amazon-ecs-tasks%2F&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=d4TVI5B3L74leTFNwKz5pOSJrMK%2Bwdm1X4YTf3IeVfo%3D&reserved=0, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faws%2Fcontainers-roadmap%2Fissues%2F572%23issuecomment-1476790562&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=w7AmeQedkOk3rkUaRIJpV60EmF5hO%2FH%2BR0DTvZntWh0%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA45MAK3T7F2QZ5SW744INDW5CTJFANCNFSM4JMIM6PA&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZvrlPvsGSF1p76S073%2BVvA2Unj7hVnEPyq0r%2BCm1Jls%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

You can read about PG&E’s data privacy practices herehttps://www.pge.com/en_US/about-pge/company-information/privacy-policy/privacy.page or at PGE.com/privacyhttps://www.PGE.com/privacy.

Mar 20 '23 20:03 citrusoft

@mreferre thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your blog, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !

Thanks!

Mar 21 '23 09:03 mreferre

Classification: Public Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.

Thanks Tommy. Do you mean something like this?

Mar 21 '23 09:03 mreferre

Classification: Public

Yes sir, purr…fect!

From: Massimo Re Ferrè @.> Date: Tuesday, March 21, 2023 at 2:17 AM To: aws/containers-roadmap @.> Cc: Hunt, Tommy @.>, Comment @.> Subject: Re: [aws/containers-roadmap] Support Task execution timeout(maximum-lifetime for a container) in ECS fargate (#572)

CAUTION: EXTERNAL SENDER!

This email was sent from an EXTERNAL source. Do you know this person? Are you expecting this email? Are you expecting any links or attachments? If suspicious, do not click links, open attachments, or provide credentials. Don't delete it. Report it by using the "Report Phish" button.

Classification: Public Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.

Thanks Tommy. Do you mean something like thishttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Faws.amazon.com%2Fblogs%2Fcontainers%2Fgraceful-shutdowns-with-ecs%2F&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BDfEOls9Sg%2FwNclfBxFuT52F0cRJVqg%2FexHYURZ7V2o%3D&reserved=0?

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faws%2Fcontainers-roadmap%2Fissues%2F572%23issuecomment-1477498910&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Sg%2FqKaqiizhje5ed4v3UbdmIbKQxjXGwOi4pgcNJRkU%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA45MALFJMMERKFYLPZXJCDW5FWZFANCNFSM4JMIM6PA&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Po0zJAMLo2w6ESR18SKUyt44YN63QI%2B7dEReoXwB7iI%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

You can read about PG&E’s data privacy practices herehttps://www.pge.com/en_US/about-pge/company-information/privacy-policy/privacy.page or at PGE.com/privacyhttps://www.PGE.com/privacy.

Mar 21 '23 13:03 citrusoft

When running ECS services with many Fargate tasks per service, we want to be sure that new tasks are able to start successfully and stay healthy for a while, before terminating older Fargate tasks. So, just having tasks killed off after a certain time without checking that new tasks can start will cause downtime.

I think maybe tasks can be freshened up by using scheduled auto scaling events. So, scale up and wait a bit for the new tasks to be stable, and then scale down. Hopefully ECS will stop the older tasks first. Result: a new set of fresh tasks.

Mar 21 '23 15:03 larstobi

When running ECS services with many Fargate tasks per service, we want to be sure that new tasks are able to start successfully and stay healthy for a while, before terminating older Fargate tasks. So, just having tasks killed off after a certain time without checking that new tasks can start will cause downtime.

I think maybe tasks can be freshened up by using scheduled auto scaling events. So, scale up and wait a bit for the new tasks to be stable, and then scale down. Hopefully ECS will stop the older tasks first. Result: a new set of fresh tasks.

@larstobi, that's (more or less) how ECS services work natively. When you create a service with n tasks in it a re-deployment will make sure (with a certain amount of knobs/configurations) that your service never goes down. Trying to orchestrate this with standalone runTask api calls is possible but not easy (especially when there is a configuration that does this for you out of the box).

The problem of the timeout is more for batch type of workloads where you launch them and you know they are going to take a certain amount of time to complete and you want to make sure that they complete without remaining pending.

Mar 21 '23 15:03 mreferre