containers-roadmap [ECS] [request]: Daemon tasks need reserved memory/cpu space in order to schedule

Tell us about your request What do you want us to build? Currently when you use DAEMON tasks you can get into situations where the task cannot be scheduled because there isn't enough CPU/Memory available on the instance. However, this is critical when you want to run something with 1 DAEMON task per host for things like log aggregation, datadog agent, etc.

I think we need something like DAEMON_MEMORY_RESERVATION_MB/DAEMON_CPU_RESERVATION that we can populate to reserve this space so that ECS can still schedule these tasks.

Which service(s) is this request for? This could be Fargate, ECS, EKS, ECR ECS

Are you currently working around this issue? How are you currently solving this problem? The only work around I'm aware of is to ensure your instances have plenty of memory/cpu which causes you to over provision your cluster and cost us more.

Sep 29 '19 21:09 brentryan

We are facing this issue as well. When scaling up a bunch of tasks get scheduled before the daemons, causing the instance to fill up, resulting in there not being enough CPU or memory for the daemon.

We use a daemon for log parsing / forwarding, so this is quite a big issue for us.

Jul 23 '20 15:07 zBart

We are currently working on a daemon scheduler enhancements that will resolve the issue as defined above: All customers will get the enhancements out of the box:

ECS will ensure that Daemon tasks are the first tasks to be placed on new ECS container instances to ensure that monitoring and security agents are launched before the application containers are launched on the container instance.
ECS will also reserve the CPU, memory and ENI resources defined for the daemon task on the Instance. This will ensure that in case of daemon launch failure or during daemon service updates,another task launch does not ‘steal’ the resources for daemon task and prevent it from Running successfully.

Please feel free to provide feedback on the Github Issue here. Hope this helps!

Nov 02 '20 06:11 pavneeta

the above sounds great @pavneeta !

Nov 02 '20 07:11 CpuID

@pavneeta Is there any way to add a feature like this :

ECS will move Replica tasks from the instance to another one in case a new Daemon task is installed and there is not enough resources to run it. So this moving reduces the resources for the correct Daemon task work .

What do you think about it ?

Nov 02 '20 08:11 kapralVV

@kapralVV I suspect this starts to dive into the territory of https://github.com/aws/containers-roadmap/issues/105 ?

Nov 02 '20 08:11 CpuID

+1 to think about what happens when adding a new daemon task to an existing cluster - that seems to be the main case not handled in @pavneeta's update above. This sounds great though!

Edit: If moving tasks off of the instance to make room is difficult, I wonder if marking an instance as unhealthy if there aren't enough resources to run the daemon, drain it, launch a new instance w/ the new resource reservations, and reschedule there.

Nov 02 '20 18:11 mwarkentin

We are currently working on a daemon scheduler enhancements that will resolve the issue as defined above: All customers will get the enhancements out of the box:
1. ECS will ensure that Daemon tasks are the first tasks to be placed on new ECS container instances to ensure that monitoring and security agents are launched before the application containers are launched on the container instance.

2. ECS will also reserve the CPU, memory and ENI resources defined for the daemon task on the Instance. This will ensure that in case of daemon launch failure or during daemon service updates,another task launch does not ‘steal’ the resources for daemon task and prevent it from Running successfully.
Please feel free to provide feedback on the Github Issue here. Hope this helps!

Is there any update on this feature set (is it still in the pipeline?) I have very little constructive feedback to add beyond "LGTM" for the two points provided, these would completely alleviate existing issues with daemon services not being present on hosts during aggressive scaling.

There are other suggestions here to take this further into the realms of rebalancing which I would certainly support and appreciate, but just addressing these first two points in a "dumb" manner at the time of instance initialization (i.e. without worrying about handling the "adding a new daemon" case above) is an extremely useful step that delivers immediate value short of any of the rebalancing/#105 realm of feature requests, which are presumably harder to deliver.

May 04 '21 10:05 tobypinder

+1

Apr 06 '22 09:04 totojack

There is another bug (which is hopefully more limited) - changing resource allocation for a Service Daemon.

If a daemon service is updated to need more memory/cpu, there is a failure state if the container instance does not have the required allocation left.

As tested - we can monitor and see the old version running, but when the deployment reaches that instance it will stop the old version of the daemon and then fail to start the new version.

Possible Solutions:

Don't stop the existing version if the new version cannot start
Allow the daemon to exceed capacity allowance to come online
(Already Mentioned) Mark the container instance as healthy and drain it's tasks
(Already Mentioned) Move non daemon tasks off the container instance (#105)

Curious on further thoughts here.

Apr 08 '22 20:04 billalley

We're having exactly the issue @billalley is describing, where it's impossible to safely change the resource reservations for a daemon service. We ended up having to work around it by having a lambda subscribed to ECS service task placement failure events, filter out everything except daemon services, and then draining any container instance where a daemon task failed to be placed. It would be far preferable if the scheduler moved some replica tasks off the container instance to make room for the daemon task.

Jul 28 '22 13:07 Nevon

This is still really painful for us. Please help.

Dec 22 '22 13:12 elijahchancey

Any update on this?

Jul 27 '23 03:07 GreasyAvocado

This is driving me insane.

Last update was 'We are currently working on a daemon scheduler enhancements that will resolve the issue', over 3 years ago. Any news?

The docs literally say, 'When you launch a daemon service on a cluster with other replica services, Amazon ECS prioritizes the daemon task. This means that the daemon task is the first task to launch on the instances and the last task to stop. This strategy ensures that resources aren't used by pending replica tasks and are available for the daemon tasks.', which is clearly incorrect.

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html#service_scheduler_daemon

Feb 05 '24 15:02 stewartcampbell