containers-roadmap
containers-roadmap copied to clipboard
Support Task execution timeout(maximum-lifetime for a container) in ECS fargate
Summary
ECS does not currently support a task execution timeout so that when a task exceeds more than certain period of time, the task must be stopped automatically like how AWS Batch has job timeouts. The task definition does not have a parameter to enforce a task/container execution timeout that will automatically trigger the container to stop after the set time.
Use-case example from a customer: I have a NLP model training job I want to run in a fargate container triggered by a lambda function. At some time, a bug might be introduced in the training code that would cause it to run indefinitely. I don't want to accidentally have those tasks piling up and have 50 tasks running for a couple weeks before we notice. That could have a cost implication. Is there a native way to kill a container if it hasn't exited on its own before a certain time?
Can this be considered as a feature request?
Thanks @nitheesha-amzn for submitting this for me! As we discussed in the ticket, a more native approach would be to have AWS Batch support Fargate launch type. This seems to be kind of a force-fit edge case for ECS.
moving this over to the container roadmaps as an ecs feature request
I can see another use case here, as mentioned in https://github.com/aws/containers-roadmap/issues/232
- applies to both Fargate and EC2 execution methods for ECS, not just Fargate
- when scheduling tasks using cron-style syntax with Cloudwatch Events/EventBridge, you would want to ensure tasks don't run indefinitely. if they did, and you have them set to spawn regularly, you would eventually exhaust cluster resources/service limits, effectively DoS'ing your AWS account
A problem I'm seeing is a task that is expected to be relatively short lived (few hours at most, but typically minutes) due to some bug is 'stuck' and still running after days.
It would be great to have a back stop that kills any jobs after X hours. Looking at the console with hundreds of tasks is hard to find the problem ones.
Would like to stop a bastion host after a period of time.
@adnxn any updates re where this sits on the roadmap? :)
+1
any updates re where this sits on the roadmap? :)
/ping @coultn
Bump! 🥓
What do you think about adding a "essential" container to the task with a sleep XX
. When the sleep ends ECS will stop the task then.
Interested to hear any experiences of using AWS Batch to achieve this, despite the objections. Also, see https://github.com/aws/containers-roadmap/issues/232.
I wonder if this can be done through AWS Config Rule. Event bridge cron rule will also do the same. I guess. Run a lambda function through a config rule every hour to stop containers started before certain time (i.e for an hour ago).
I have the same issue, want to stop container after an hour, but not sure how to do it. I need to do as part of several different stacks so cluster and task ids will be different. Best if its part of task definition, otherwise need to create a config rule to target all the different cluster/tasks.
@sky4git An AWS Step Function State Machine is one solution to your use case. It can 'monitor' and take actions based on a time window. Also you can create CloudWatch alarms to monitor failed executions and timeouts.
I wonder if this can be done through AWS Config Rule. Event bridge cron rule will also do the same. I guess. Run a lambda function through a config rule every hour to stop containers started before certain time (i.e for an hour ago).
I have the same issue, want to stop container after an hour, but not sure how to do it. I need to do as part of several different stacks so cluster and task ids will be different. Best if its part of task definition, otherwise need to create a config rule to target all the different cluster/tasks.
Here is another use case which would benefit from this requested feature: We run end-to-end tests on an ECS Task with Fargate. If, for a bug, a test gets stuck, the task could potentially run forever. I haven't found any way to set a CloudWatch alarm for task duration.
I'd love this feature.
Some of our tasks leak memory very slowly. It'd be great to be able to set a maximum task life of ~3 months, to keep the memory leakage small. In general, it seems like a modern best practice to reap your processes fairly early, to not allow very long-lived processes in your systems. It would be great if Fargate ECS could assist with this. We would also love it if regular ECS supported this.
I'd love this feature.
Some of our tasks leak memory very slowly. It'd be great to be able to set a maximum task life of ~3 months, to keep the memory leakage small. In general, it seems like a modern best practice to reap your processes fairly early, to not allow very long-lived processes in your systems. It would be great if Fargate ECS could assist with this. We would also love it if regular ECS supported this.
@rektide I think that you could use a Step Function State Machine to set a max time and shut down the ECS task.
+1
Until this is natively implemented in ECS Scheduled Tasks, here are some options you have to implement timeouts:
- Wrap the command of your job container with
timeout
(assuming it's available in the container). e.g.timeout X mycommand arg1 arg2; STATUS=$?; if [ $STATUS -eq 124 ]; then echo 'Job Timed Out!'; fi; exit $STATUS
- Add an essential container to the task definition with command
sleep X
. When it times out, the whole task exits. - Use external entities (such as Step Functions) to monitor and stop tasks that exceed a max lifetime.
- Just add a CloudWatch Alarm that notifies you when some tasks have ran for too long, and stop them manually.
- Use Kubernetes instead of ECS. Seriously, no native timeouts on scheduled tasks?
This would be a great feature!
@TarekAS which metrices did you used to set the Cloudwatch Alarm?
I don't understand how is it possible that such a basic feature is not available
cc @ofiliz
This is a way to introduce a timeout for ECS tasks. Feedback welcome.
https://it20.info/2023/03/configuring-a-timeout-for-amazon-ecs-tasks/
@mreferre thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your blog, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !
Classification: Public
Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.
From: Jeroen Habets @.> Date: Monday, March 20, 2023 at 12:08 PM To: aws/containers-roadmap @.> Cc: Hunt, Tommy @.>, Comment @.> Subject: Re: [aws/containers-roadmap] Support Task execution timeout(maximum-lifetime for a container) in ECS fargate (#572)
CAUTION: EXTERNAL SENDER!
This email was sent from an EXTERNAL source. Do you know this person? Are you expecting this email? Are you expecting any links or attachments? If suspicious, do not click links, open attachments, or provide credentials. Don't delete it. Report it by using the "Report Phish" button.
@mreferrehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmreferre&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ohqfmLgy9g2Xr1DsTAqNfvXOecT%2FIvqfZUC%2FZf2lpqM%3D&reserved=0 thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your bloghttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit20.info%2F2023%2F03%2Fconfiguring-a-timeout-for-amazon-ecs-tasks%2F&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=d4TVI5B3L74leTFNwKz5pOSJrMK%2Bwdm1X4YTf3IeVfo%3D&reserved=0, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !
— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faws%2Fcontainers-roadmap%2Fissues%2F572%23issuecomment-1476790562&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=w7AmeQedkOk3rkUaRIJpV60EmF5hO%2FH%2BR0DTvZntWh0%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA45MAK3T7F2QZ5SW744INDW5CTJFANCNFSM4JMIM6PA&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZvrlPvsGSF1p76S073%2BVvA2Unj7hVnEPyq0r%2BCm1Jls%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>
You can read about PG&E’s data privacy practices herehttps://www.pge.com/en_US/about-pge/company-information/privacy-policy/privacy.page or at PGE.com/privacyhttps://www.PGE.com/privacy.
@mreferre thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your blog, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !
Thanks!
Classification: Public Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.
Thanks Tommy. Do you mean something like this?
Classification: Public
Yes sir, purr…fect!
From: Massimo Re Ferrè @.> Date: Tuesday, March 21, 2023 at 2:17 AM To: aws/containers-roadmap @.> Cc: Hunt, Tommy @.>, Comment @.> Subject: Re: [aws/containers-roadmap] Support Task execution timeout(maximum-lifetime for a container) in ECS fargate (#572)
CAUTION: EXTERNAL SENDER!
This email was sent from an EXTERNAL source. Do you know this person? Are you expecting this email? Are you expecting any links or attachments? If suspicious, do not click links, open attachments, or provide credentials. Don't delete it. Report it by using the "Report Phish" button.
Classification: Public Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.
— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faws%2Fcontainers-roadmap%2Fissues%2F572%23issuecomment-1477498910&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Sg%2FqKaqiizhje5ed4v3UbdmIbKQxjXGwOi4pgcNJRkU%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA45MALFJMMERKFYLPZXJCDW5FWZFANCNFSM4JMIM6PA&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Po0zJAMLo2w6ESR18SKUyt44YN63QI%2B7dEReoXwB7iI%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>
You can read about PG&E’s data privacy practices herehttps://www.pge.com/en_US/about-pge/company-information/privacy-policy/privacy.page or at PGE.com/privacyhttps://www.PGE.com/privacy.
When running ECS services with many Fargate tasks per service, we want to be sure that new tasks are able to start successfully and stay healthy for a while, before terminating older Fargate tasks. So, just having tasks killed off after a certain time without checking that new tasks can start will cause downtime.
I think maybe tasks can be freshened up by using scheduled auto scaling events. So, scale up and wait a bit for the new tasks to be stable, and then scale down. Hopefully ECS will stop the older tasks first. Result: a new set of fresh tasks.
When running ECS services with many Fargate tasks per service, we want to be sure that new tasks are able to start successfully and stay healthy for a while, before terminating older Fargate tasks. So, just having tasks killed off after a certain time without checking that new tasks can start will cause downtime.
I think maybe tasks can be freshened up by using scheduled auto scaling events. So, scale up and wait a bit for the new tasks to be stable, and then scale down. Hopefully ECS will stop the older tasks first. Result: a new set of fresh tasks.
@larstobi, that's (more or less) how ECS services work natively. When you create a service with n
tasks in it a re-deployment will make sure (with a certain amount of knobs/configurations) that your service never goes down. Trying to orchestrate this with standalone runTask api calls is possible but not easy (especially when there is a configuration that does this for you out of the box).
The problem of the timeout is more for batch type of workloads where you launch them and you know they are going to take a certain amount of time to complete and you want to make sure that they complete without remaining pending.