amazon-ecs-cli `timeout` option ignored when containers start but container health checks fail

trafficstars

Hey :wave:

Summary

When updating a running service (with ELB in my case) and a container in the Task's health check command fails after new Tasks are started, the process retries for a long time (unknown how long), despite using the --timeout flag.

Description

ecs-cli version 1.16.0 (5c02c34)
Linux

Seen while executing an ecs-cli compose service up command, but may present itself elsewhere, too.

Conditions are:

previously deployed ECS Service is running
update to service introduces a bad healthcheck command in Task Def which causes failed container health checks and so new Tasks are stopped again by ECS.

INFO[3219] (service <snip>) (task <snip>) failed container health checks.  timestamp="2019-11-22 13:05:12 +0000 UTC"

continues in a loop of: new task is added, task fails container health checks, task is removed, repeat

We have --timeout set to 8. However, going by the usage of --timeout:

--timeout value   Specifies the timeout value in minutes (decimals supported) to wait for the running task count to change. If the running task count has not changed for the specified period of time, then the CLI times out and returns an error. Setting the timeout to 0 will cause the command to return without checking for success. (default: 5)

...it seems to me that this situation does not apply because new tasks are successfully started (runningCount=2 in the log), but then fails container health checks. So maybe --timeout doesn't consider this case as a "timeout-able" offence because runningCount is increasing.

The command:

ecs-cli compose \
  --cluster-config <config name> \
  --project-name <project name> \
  --file <path to docker-compose.yml> \
  --ecs-params <path to ecs-params.yml> \
  service \
  --region <aws region> \
  --target-group-arn <tg arn> \
  --container-name <tg container name> \
  --container-port <tg container port>  \
  up \
  --timeout 8

Config files

Valid docker-compose.yml and ecs-params.yml used to previously deploy successfully, but with the addition of a container health check that causes an UNHEALTHY health state.

Expected Behavior

When a new Task in a Service is started, but immediately fails container-level health checks, the ecs-cli [...] up command should exit (non-zero) after --timeout is exceeded.

The behaviour should be the same as in a case where some problem prevents the Service runningCount from changing, e.g. if the specified docker image does not exist.

Observed Behavior

INFO[0000] Using ECS task definition                     TaskDefinition="[ ✂️ ]:18"
INFO[0000] Updated the ECS service with a new task definition. Old containers will be stopped automatically, and replaced with new ones  desiredCount=1 force-deployment=false service=[ ✂️ ]
INFO[0000] Service status                                desiredCount=1 runningCount=1 serviceName=[ ✂️ ]
INFO[0015] (service [ ✂️ ]) has started 1 tasks: (task [ ✂️ ]).  timestamp="2019-11-22 12:11:49 +0000 UTC"
INFO[0060] Service status                                desiredCount=1 runningCount=2 serviceName=[ ✂️ ]
INFO[0060] (service [ ✂️ ]) registered 1 targets in (target-group [ ✂️ ])  timestamp="2019-11-22 12:12:40 +0000 UTC"
INFO[0105] Service status                                desiredCount=1 runningCount=1 serviceName=[ ✂️ ]
INFO[0105] (service [ ✂️ ]) (task [ ✂️ ]) failed container health checks.  timestamp="2019-11-22 12:13:20 +0000 UTC"
INFO[0105] (service [ ✂️ ]) has stopped 1 running tasks: (task [ ✂️ ]).  timestamp="2019-11-22 12:13:20 +0000 UTC"
INFO[0166] (service [ ✂️ ]) has started 1 tasks: (task [ ✂️ ]).  timestamp="2019-11-22 12:14:20 +0000 UTC"
INFO[0181] (service [ ✂️ ]) deregistered 1 targets in (target-group [ ✂️ ])  timestamp="2019-11-22 12:14:29 +0000 UTC"
INFO[0211] Service status                                desiredCount=1 runningCount=2 serviceName=[ ✂️ ]
INFO[0211] (service [ ✂️ ]) registered 1 targets in (target-group [ ✂️ ])  timestamp="2019-11-22 12:15:08 +0000 UTC"
INFO[0256] Service status                                desiredCount=1 runningCount=1 serviceName=[ ✂️ ]
INFO[0256] (service [ ✂️ ]) (task [ ✂️ ]) failed container health checks.  timestamp="2019-11-22 12:15:48 +0000 UTC"
INFO[0256] (service [ ✂️ ]) has stopped 1 running tasks: (task [ ✂️ ]).  timestamp="2019-11-22 12:15:48 +0000 UTC"
INFO[0286] (service [ ✂️ ]) deregistered 1 targets in (target-group [ ✂️ ])  timestamp="2019-11-22 12:16:27 +0000 UTC"

.
.
.

[For 1 hour, until the IAM session expired]

.
.
.

INFO[3430] (service [ ✂️ ]) has started 1 tasks: (task [ ✂️ ]).  timestamp="2019-11-22 13:08:41 +0000 UTC"
INFO[3490] Service status                                desiredCount=1 runningCount=2 serviceName=[ ✂️ ]
INFO[3490] (service [ ✂️ ]) registered 1 targets in (target-group [ ✂️ ])  timestamp="2019-11-22 13:09:40 +0000 UTC"
INFO[3520] Service status                                desiredCount=1 runningCount=1 serviceName=[ ✂️ ]
INFO[3520] (service [ ✂️ ]) (task [ ✂️ ]) failed container health checks.  timestamp="2019-11-22 13:10:10 +0000 UTC"
INFO[3520] (service [ ✂️ ]) has stopped 1 running tasks: (task [ ✂️ ]).  timestamp="2019-11-22 13:10:10 +0000 UTC"
INFO[3565] (service [ ✂️ ]) deregistered 1 targets in (target-group [ ✂️ ])  timestamp="2019-11-22 13:11:00 +0000 UTC"
ERRO[3580] Error describing service                      error="ExpiredTokenException: The security token included in the request is expired\n\tstatus code: 400, request id: [ ✂️ ]" service=[ ✂️ ]
FATA[3580] ExpiredTokenException: The security token included in the request is expired

Nov 22 '19 16:11 dbednall

it seems to me that this situation does not apply because new tasks are successfully started (runningCount=2 in the log), but then fails container health checks. So maybe --timeout doesn't consider this case as a "timeout-able" offence because runningCount is increasing.

This is correct. In your case, the expected number of tasks do successfully launch. The goal of the timeout was to wait until all desired tasks successfully launch. Depending on your container health check settings, it could take some time for them to become unhealthy. You'd need an additional timeout after the deployment succeeded to make sure it stayed healthy.

I'm changing this to a feature request, because this ask was not the intention behind the timeout flag. I think it instead you could have a new flag like --wait-for-healthy.

Side note, you can see the health of tasks in your service with ecs-cli compose service ps.

Nov 22 '19 19:11 PettitWesley

not the intention behind the timeout flag

Yeah this makes sense - the stage of the deployment when timeout might get involved is earlier than the issue I'm seeing.

Before I start diving into the code too much myself, could you explain what the command is actually doing at this stage? Will it just wait forever until the service is stable again?

Adding a --wait-for-healthy definitely sounds more like what's needed here... but it would imply that without it, the command would exit regardless of the health state of the containers. That's not what happens -- the biggest pain point to me is that the command doesn't seem to exit at all. -- there's no way to set an overarching timeout for the whole operation.

Nov 25 '19 16:11 dbednall

Will it just wait forever until the service is stable again?

So right now the command sits in a while loop periodically checking the service state. It exits when either:

The desired count and running count are equal
The timeout has been reached

the biggest pain point to me is that the command doesn't seem to exit at all. -- there's no way to set an overarching timeout for the whole operation.

Do you mean the ECS CLI command? That sounds wrong to me. I wrote the timeout code more than 2 years ago so my memory is fuzzy... but it should always quit eventually, either you specify a timeout or you get a default one.

If the command get's stuck and never quits... that's a bug.

As far as implementing --wait-for-healthy, you'd still need to wait for desired count and running count to match. You'd probably want to keep the existing timeout in place in case the deployment fails at that stage. Then, once all tasks are running, you'd implement another, very similar timeout that would check the health status. If I remember correctly, the health status should initially be "UKNOWN" and then should move to "HEALTHY". So you could have an additional timeout period in which you try to wait for all tasks to reach the healthy state.

The code for the existing timeout is here: https://github.com/aws/amazon-ecs-cli/blob/master/ecs-cli/modules/cli/compose/entity/service/service_helper.go#L109

Nov 25 '19 18:11 PettitWesley

Thanks for the pointers :)

From the waiter function itself:

Looks like servicesWaitDelay is set to 15 seconds with servicesMaxRetries at 40 -- 10 minutes, hardcoded. https://github.com/aws/amazon-ecs-cli/blob/3970a6c1f38bc750ff9c068f53220aa177c7df54/ecs-cli/modules/utils/waiters/waiters.go#L37 https://github.com/aws/amazon-ecs-cli/blob/3970a6c1f38bc750ff9c068f53220aa177c7df54/ecs-cli/modules/utils/waiters/waiters.go#L42

But in the loop, servicesMaxRetries isn't implemented - just true https://github.com/aws/amazon-ecs-cli/blob/3970a6c1f38bc750ff9c068f53220aa177c7df54/ecs-cli/modules/utils/waiters/waiters.go#L51-L52 ...unlike the equivalent for Tasks https://github.com/aws/amazon-ecs-cli/blob/3970a6c1f38bc750ff9c068f53220aa177c7df54/ecs-cli/modules/utils/waiters/waiters.go#L70

Could be wrong, this is why I don't see the command exiting for a deployment in that scenario?

So https://github.com/aws/amazon-ecs-cli/blob/3970a6c1f38bc750ff9c068f53220aa177c7df54/ecs-cli/modules/cli/compose/entity/service/service_helper.go#L141 is true, but the other conditions below that one are not, because the containers/task stops shortly after and ECS is still running the old deployment while trying to start the new one side by side.

I guess there is a reason why it's different for Services than Tasks, and doesn't use servicesMaxRetries? Git history doesn't seem to have any clues.

Nov 26 '19 11:11 dbednall

amazon-ecs-cli amazon-ecs-cli copied to clipboard

`timeout` option ignored when containers start but container health checks fail

Summary

Description

Config files

Expected Behavior

Observed Behavior

amazon-ecs-cli
amazon-ecs-cli copied to clipboard