containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] [request]: Add Readiness Checks

Open dastbe opened this issue 2 years ago • 9 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

Today in ECS, any healthcheck failure results in the immediate termination of the task. While this is theoretically desirable, in practice this can exacerbate outages rather than help. For example, a momentary blip in health across the fleet can lead to a minutes-long rotation of all tasks which, depending on what path ECS takes, can be disruptive to customers. Additionally, task replacement is comparatively expensive process, requiring various provisioning systems to be online.

Contemporaries like Kubernetes have opted to make a distinction between "liveliness" and whether a task should kept running or replaced and "readiness" and whether a task should be routed to. Having the ability to encode readiness checks as distinct from what exists today would help service owners configure their services to isolate tasks during periods of temporary instability without the full blown replacement.

caveat: one interesting aspect to this request is that some systems ECS integrates with have their own readiness checking built-in, i.e. ELB. Any such system should also be able to change how ELB's signals are treated by ECS.

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

See above

Are you currently working around this issue?

Not particularly. This is a fundamental behavior of ECS and not one you can easily work around.

Additional context Anything else we should know?

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

dastbe avatar Mar 04 '22 07:03 dastbe

I imagine most people who read this are well aware of the reasons why this is desirable, but another example: restarting an ECS task can have other side-effects: an ECR download, ENI allocation. If liveness is fine and readiness is not, restarting ECS tasks adds a lot of noise and sometimes even costs to a system without merit. The task is live and restarting it doesn't change that.

If you're using ECS, try to make sure your ECS health checks are as close to a pure liveness check as possible -- that they do not rely on external resources. That doesn't mean that your service is "ready", but it's better than having ECS assume a task needs to be replaced because a dependency is down.

geoffreywiseman avatar Jan 09 '23 23:01 geoffreywiseman

https://github.com/aws/containers-roadmap/issues/1270 <- potentially related "Target Group starts making healthchecks from the moment target is registered. Hence even if service has HealthCheckGracePeriod Confifgured, it is possible that target is marked Unhealthy at Target group and UnhealthyHost Count metric is updated. "

"As a workaround, i can change the parameters of my health check to be either less frequent or require more consecutive failed checks before calling a target unhealthy, but that means that the health check will be that much slower at detecting any actual issues. Seems bizarre to me that the ECS health check has a grace period concept built in but the target group health checks don't."

fierlion avatar Feb 22 '23 22:02 fierlion

Me and my team are looking forward for this feature long time ago. I totaly agree with other commenters and thinking that the least ECS team could bring us is a toggle that would allow ignoring ELB healthchecks and take no action on them. This would bring so much relief for teams that develops API services. Because ELB by design does not deregisters and kills unhealthy targets, it just removes traffic from them.

Speaking separatelly about current ECS flow for mitigating unhealthy targets: it is strange to me, that it does not trigger rolling update, but rather just stops unhealthy task and only then deals with Running count != Desired Count.

nhlushak avatar Sep 05 '23 14:09 nhlushak

Time to move to lambdas.

allenbrubaker avatar Sep 19 '23 19:09 allenbrubaker

We have shipped an improvement to ECS scheduler, that would prioritize starting a new healthy tasks, before killing tasks that were marked unhealthy. You can read more in this WNP: https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-ecs-applications-resiliency-unpredictable-load-spikes/ Blog post with deep dive: https://aws.amazon.com/blogs/containers/a-deep-dive-into-amazon-ecs-task-health-and-task-replacement/

genbit avatar Nov 03 '23 17:11 genbit

@genbit ECS improvements you linked are definitely great, but lacking proper "readiness" still causes warmup phase quite cumbersome. I skimmed trough documentation but couldn't find anything related.

Our application requires 30 to 60s to fully warm-up (fetch content, process, warmup local cache). We don't want any request to land instance which is not being fully warmed-up, at the same time our health check indicates whether instance is in correctly state. To achieve this we resorted to put artificial delay by healthCheckGracePeriod (CDK, property of ApplicationLoadBalancedFargateService) as well as tweaking healtcheck's healthyThresholdCount and unhealthyThresholdCount properties.

This however doesn't prevent requests reaching just warming up instances, causing requests to wait (cache locking) - and perhaps timeout when client expects faster response.

It would be beneficiary to see differentiation of readiness and liveness. (however I have to admit we don't think of going to overhead of K8s)

wdolek avatar Jan 07 '24 21:01 wdolek

+1 to this request.

Killing tasks that are not capable of handling more traffic because they are too busy handling (possibly slow) other requests, rather than them being dead, is far from ideal and only exacerbates the problem.

JustinReshop avatar Feb 07 '24 03:02 JustinReshop

+1 to this request

fahd-sainsburys avatar Jul 05 '24 13:07 fahd-sainsburys

+1 to this request

velhoi avatar Jul 17 '24 15:07 velhoi