containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] : Registration Delay for addition in Target Group

Open tarun-wadhwa-mmt opened this issue 4 years ago • 9 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment -->

Tell us about your request What do you want us to build? Adding support for specifying a delay before registering tasks in Target Group (registration_delay)

Which service(s) is this request for? This could be Fargate, ECS, EKS, ECR

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Target Group starts making healthchecks from the moment target is registered. Hence even if service has HealthCheckGracePeriod Confifgured, it is possible that target is marked Unhealthy at Target group and UnhealthyHost Count metric is updated. The alerting based on UnhealthyHosts gets triggered and creates false alerts in systems. This even affects when user is back-tracking issues and has to co-relate unhealthy host events with container restarts or deployments.

Adding registration delay will help us reduce unnecessary noise in system

tarun-wadhwa-mmt avatar Feb 12 '21 08:02 tarun-wadhwa-mmt

This is an information I couldn't find in AWS docs, and also even AWS support was pretty unhelpful.

When I asked about my unhealthy containers, I kinda got a very polite way of saying 'well, doesn't seem to hurt, so feel free to ignore'.

I'd ask to at least have this in the docs. If they are, please give me the link, as I couldn't find in any of the obvious pages.

cintiadrrezdy avatar Sep 29 '21 23:09 cintiadrrezdy

The problem in my case is that the failed target group health checks are set up to stop the tasks for the service and boot up a new instance under the assumption that something unrecoverable has happened in the task. Now that I have a task that takes longer between startup and being ready to receive traffic, this gets caught in a loop, declaring the still-initializing tasks as unhealthy, marking them for shutdown, and then booting up a new task that it will then declare unhealthy ad infinitum.

As a workaround, i can change the parameters of my health check to be either less frequent or require more consecutive failed checks before calling a target unhealthy, but that means that the health check will be that much slower at detecting any actual issues. Seems bizarre to me that the ECS health check has a grace period concept built in but the target group health checks don't.

kulinsj avatar May 09 '22 21:05 kulinsj

better then a delay would be only to register on target group after ECS's health check consider the container healthy when defined in task definition

so If health check is defined on task:

  • wait for healthy status on ECS to register it on Target Group else
  • keep the same behavior today (maybe a delay parameter fits here)

My problem today:

My container takes about 250s to be up and running, to make it works today I had to increase the "interval" to high number the

so if I set to 100s of interval and have 3 on threshold the timeline is:

try 1: 0s - fail try 2: 100s - fail try 3: 200s - fail result : Unhealthy

so if I set to 150s of interval and have 3 on threshold the timeline is:

try 1: 0s - fail try 2: 150s - fail try 3: 300s - success try 4: 450s - success try 5: 600s - success result: Healthy Problem: take 10 minutes to have a container working on ALB

vendrusculo avatar Apr 19 '23 12:04 vendrusculo

better then a delay would be only to register on target group after ECS's health check consider the container healthy when defined in task definition

so If health check is defined on task:

  • wait for healthy status on ECS to register it on Target Group else
  • keep the same behavior today (maybe a delay parameter fits here)

My problem today:

My container takes about 250s to be up and running, to make it works today I had to increase the "interval" to high number the

so if I set to 100s of interval and have 3 on threshold the timeline is:

try 1: 0s - fail try 2: 100s - fail try 3: 200s - fail result : Unhealthy

so if I set to 150s of interval and have 3 on threshold the timeline is:

try 1: 0s - fail try 2: 150s - fail try 3: 300s - success try 4: 450s - success try 5: 600s - success result: Healthy Problem: take 10 minutes to have a container working on ALB

How are you able to track it. I also need to find out which checks failed & which checks passed for each instance in my target group.

Parth909 avatar Sep 12 '23 14:09 Parth909

Our application is a Java Spring and we were able to setup a filter on /health endpoint that logs all requests even before the application is ready, then we observed this behavior

vendrusculo avatar Sep 12 '23 18:09 vendrusculo

We are forced to buy twice the capacity we really need (at double the cost, of course) just to be able to pass the startup checks and still have ECS and the ALB to detect failures within reasonable time.

code-and-such avatar May 13 '24 17:05 code-and-such

Any updates on this from the AWS team? it's been over 4 years and this is still an annoying limitation to work around, especially for ECS tasks that take more than a minute or so to spin up and start serving requests...

bjiusc avatar Apr 15 '25 22:04 bjiusc