containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] [request]: Prioritize Daemon task scheduling above Replica tasks

Open ericdahl opened this issue 4 years ago • 24 comments

Tell us about your request When our Replica ECS Services scale up and we launch new ECS Container Instances (from an ASG, scaling on *Reservation metrics), sometimes the Replica tasks are launched on the instance fast enough that our Daemon Services do not have a chance to launch on these new hosts. If these replicas use enough CPU/memory, there may not be room for the Daemon services to run.

For example, we have a few daemon services to collect host-level metrics and forward log files. We want these to run on every host. Periodically we see that hosts have been saturated with Replica tasks and there's no room for the Daemon tasks. This means we lack monitoring and visibility into these hosts.

Which service(s) is this request for? This could be ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Every host should have Daemon tasks provisioned on it, reliably.

Are you currently working around this issue? We manually go into the ECS console, review all the (possibly hundreds) of hosts to identify which one is missing Daemon services, then run "Stop Task" for one of the replica tasks on that host in order for Daemons tasks to have room to launch.

ericdahl avatar Aug 01 '19 21:08 ericdahl

Just got bitten again by this today (has happened more than once). had to go find some tasks to kill off to make room. Then the issue is the Daemon service wouldn't "retry" quickly enough to fill the void on that host, and other tasks would get binpacked in there...

CpuID avatar Aug 19 '19 21:08 CpuID

@ericdahl @CpuID Thank you so much for your valuable feedback. We are currently in the middle of scoping out the solution to this known problem.

pavneeta avatar Aug 22 '19 23:08 pavneeta

Just ran into this again on a deploy of a daemon service (a log ingest process - filebeat), had to stop a bunch of other tasks to make room...

CpuID avatar Sep 14 '19 04:09 CpuID

Is there a projected timeline for this? Or is this:

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/start_task_at_launch.html

a viable alternative?

efenderbosch avatar Jan 13 '20 16:01 efenderbosch

The ECS agent is run as a docker container as makes it onto every EC2 instance. A work around therefore would be to start your platform critical daemon processes without an ECS task definition and instead, do it in your user-data.txt.

You'd need a wrapper to ensure the container is always running and restarted if a rudimentary health check fails among other things (like a way to insert secrets to the docker containers).

I sure do wish the ecs-agent team would prioritise this as it seems pretty obvious that Daemons are important processes.

edify42 avatar Jan 27 '20 23:01 edify42

The ECS agent is run as a docker container as makes it onto every EC2 instance. A work around therefore would be to start your platform critical daemon processes without an ECS task definition and instead, do it in your user-data.txt.

yea its possible to start things prioritized outside of the ECS ecosystem, but you then need to reserve resources in the ECS agent, and deploys of new versions of daemon services are still a PITA (you need to replace the EC2 instances, instead of just do a daemon service deploy). a step backwards overall ;) (probably what we all did before daemon services were a thing)

I sure do wish the ecs-agent team would prioritise this as it seems pretty obvious that Daemons are important processes.

+100 - the ideal goal here

CpuID avatar Jan 27 '20 23:01 CpuID

@pavneeta any updates on where this is at?

mwarkentin avatar Aug 06 '20 20:08 mwarkentin

+1 for this one

davinod avatar Sep 17 '20 00:09 davinod

+1 We had to wrestle with this on a dense ECS cluster today, would be great to see this baked in!

ctcherry avatar Sep 17 '20 00:09 ctcherry

Hi, guys! Any updates on this topic? It is still actual. Thanks!

akrymets avatar May 24 '21 14:05 akrymets

Hi @akrymets, thanks for the comment!

Here is the latest update announcement we published in May regarding ECS daemon scheduling improvements!

https://aws.amazon.com/blogs/containers/improving-daemon-services-in-amazon-ecs/

toricls avatar May 24 '21 15:05 toricls

Any updates on this? It still occurs quite regularly.

alexpcoleman avatar Jan 21 '22 17:01 alexpcoleman

Hey @toricls

I was wondering if there were any updates on this as whilst the improvements you mentioned above are great to hear, we would be looking for complete reliability in the placement of Daemon tasks.

To rely on Daemon tasks to, for example, deploy logging agents alongside application tasks across hundreds, to potentially thousands of instances, we would need to have a very high level of confidence that the Daemon tasks would always be placed.

Thanks!

connorcartwright avatar Mar 04 '22 12:03 connorcartwright

Hey @connorcartwright, thanks for your interest and feedback on this!

We know how this is important for our customers who're willing to use Daemon type tasks to achieve reliable operations, and actually we've been having continuous conversations on this topic in the team, including last week.

Unfortunately there's nothing I can share about its concrete progress at this moment here, but will get you updated as soon as we got meaningful progress on this!

toricls avatar Mar 07 '22 01:03 toricls

I'm curious as to whether the following could be a workaround for this:

  • add an attribute to all ECS instances which are started.
  • have all services configured to use placement constraints to not allow them to run on instances with said attribute
  • allow DAEMON tasks to run on the instances with said attribute. the DAEMON task would then be able to remove the attribute of the instance when it is ready and initialized, thereby untainting it, so that service can run.

The con here is that this requires your task to access the AWS API, and I'm not sure how it would affect scaling / capacity providers if there is a scale-out event.

ianvernon avatar Apr 26 '22 05:04 ianvernon

Hello team, do we have a release timeline for this feature request ?

sdpoueme avatar Jun 25 '22 14:06 sdpoueme

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html#service_scheduler_daemon

states:

Amazon ECS reserves container instance compute resources including CPU, memory, and network interfaces for the daemon tasks. When you launch a daemon service on a cluster with other replica services, Amazon ECS prioritizes the daemon task. This means that the daemon task is the first task to launch on the instances and the last task to stop. This stategy ensures that resources aren't used by pending replica tasks and are available for the daemon tasks.

Have there been any recent changes that address this issues item?

sopeters avatar Aug 26 '22 16:08 sopeters

Just encountered this issue myself, so clearly still an issue. Is there an update regarding this issue? The documentation does not match the behavior, so this no longer seems like a request and should now be considered a bug.

turacma avatar May 05 '23 20:05 turacma

As someone who is regularly encounters this issue, I am looking forward to moving workload to EKS where I can use priority classes. The scheduling calculation to ensure daemonset tasks are scheduled first as in the literal request does seem arduous, but I'm OK with evicting lower priority replica tasks whenever the scheduler gets around to placing daemonset tasks.

immanetize avatar Nov 17 '23 23:11 immanetize

We still encounter this regularly. Especially during large scale up events.

mcfadden avatar Dec 05 '23 13:12 mcfadden

We encounter this during scale up events (especially when there are also daemon service updates to deploy in the same change set). Would appreciate the ECS team responding if possible.

pwrmiller avatar Dec 13 '23 10:12 pwrmiller

This is open for over 4 years. Really, how is this still an issue? Is ECS a deprecated product?

balexx avatar Jan 04 '24 16:01 balexx

This is not only a memory issue. Some containers NEED the daemons to be present and running, for instance logging daemons. The daemons must be brought up before regular tasks.

mlanett avatar Apr 18 '24 17:04 mlanett

I am facing the same issue. We got around the issue by using this workaround below but would love for AWS to resolve this issue to reduce ECS server configuration needed (aka complexity)

Workaround: for every application ECS service we added a ECS Service placement constraint configuration of "memberOf (task:group == service:deamonServiceName)" to make the Deamon service start a task on the ECS instance before any application tasks are assigned.

emorneau avatar Apr 30 '24 11:04 emorneau