containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] [Deployment]: ECS deployment circuit breaker should deal with exceptional exit container

Open forward2you opened this issue 3 years ago • 40 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request What do you want us to build?

Use Cloudformation to update ECS background task

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

As the demo of Deployment circuit breaker, the container start failed with error: docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"exit\": executable file not found in $PATH": unknown. Which would be handled by deployment circuit breaker.

However more common case is container start succeed but exceptional exit.

For example, the Dockerfile:

FROM alpine:latest
exit 1

The current situation is the container stoped with Essential container in task exit and marked as failed, but when the second task start, the failedTasks count will be reset to 1, which means the circuit breaker threshold will never triggered.

What we expected is, regard running but abnormal exit as failed, and not reset the failedTasks count, then meet the breaker threshold and roll back.

forward2you avatar Dec 24 '20 11:12 forward2you

This issue could probably use a higher priority, since even the Circuit Breaker official demo (https://www.youtube.com/watch?v=Y2Ez9M7A95Y) doesn't work because of this behaviour.

It also makes the CB unreliable, as it doesn't catch all types of deployment failures. As an example, I have microservices that will throw an exception and exit early if there is an error in their DB connection string. With the current behaviour of CB, deploying one of those with a wrong DB string will churn on forever.

LeMikaelF avatar Mar 08 '21 18:03 LeMikaelF

Any more visibility on this issue? This behavior kind of defeats the purpose of the circuit breaker feature - if we can't trust it to catch all types of ECS task failures, we'll need to implement our own fail-safes, alerts, and rollback functionality anyway.

nickfaughey avatar Apr 06 '21 18:04 nickfaughey

Having a similar issue to @LeMikaelF where the app shuts down when a required environment variable is missing. The app just shuts down and the failedTasks is incremented to 1, but as soon as that happens, it is decremented back to 0 and the threshold is never crossed. If CBs don't work for this kind of situation then they are pretty useless to our team :(

mimozell avatar Apr 16 '21 08:04 mimozell

Can someone explain why it decrements to begin with? I'm not sure I understand.

dezren39 avatar May 12 '21 17:05 dezren39

I just spent an entire day trying to figure out why my circuit breaker was never triggering before coming across this issue. I dutifully followed the documentation and built out the SNS topic and EventBridge rules and subscribed them to Datadog to send me notifications about when my deploys fail, only to discover that was all wasted effort because the circuit breaker is functionally useless.

I just need to know when my containers are spinning up, "running" for a few seconds, and dying before ever being marked healthy by the ALB they sit behind. This certainly seems like core functionality of a deployment circuit breaker, and the documentation absolutely misleads you into thinking that this is how the circuit breaker will behave. This paragraph says that the circuit breaker will trip if the ALB healthchecks mark the container as unhealthy, but if the container exits before the ALB healthchecks run enough times to mark it as unhealthy, then that container is considered deployed successfully and it just retries forever.

Even a circuit breaker case as naive as saying "if this deploy hasn't been marked as completed in X minutes, mark it as failed" would be beneficial for this specific case.

jtyoung avatar May 12 '21 22:05 jtyoung

Ran into this issue the other day. Caused a failing deploy to keep failing vs rolling back. This was then followed by autoscaling trying to scale up the failing deployment (because it was more recent?). Then issues with scaling being too low on the old deploy that was still live until the failing deployment was taken care of with manual intervention.

Does anyone have a good workaround for this until its resolved?

robert-put avatar Jun 03 '21 18:06 robert-put

@robert-put The workaround is to not use circuit breaker at all. Do something on your own. Projects like ecs-deploy may help with that.

jenshoffmann1331 avatar Jun 04 '21 06:06 jenshoffmann1331

Just ran into similar issue. Deployment circuit breaker: "enabled with rollback"

Updated task definition and the deployment was stuck in "In progress" for at least 30m, no events at all. Tried twice more, no further deployments or events, basically had to delete the service and create it again, not ideal to say the least..

hampsterx avatar Jun 24 '21 23:06 hampsterx

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.

vibhav-ag avatar Dec 16 '21 22:12 vibhav-ag

this issue was the straw that broke the camel's back. we moved to kubernetes.

On Fri, Dec 17, 2021, 00:45 vibhav-ag @.***> wrote:

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung https://avanan.url-protection.com/v1/url?o=https%3A//github.com/jtyoung&g=YjE3OTI1NTU3OGRlMTdjNw==&h=OWMzNTRjNjJmYzAwZjRjMWIxNzdkZGM0YzdlZmQ3YzllZmUwMmYwMjAxOTNlMmViZjBmMDE3YWFmMmJlOGM3Mw==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA== I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.

— Reply to this email directly, view it on GitHub https://avanan.url-protection.com/v1/url?o=https%3A//github.com/aws/containers-roadmap/issues/1206%23issuecomment-996251344&g=ZWVjM2QwYTIyZTFkZDA0MA==&h=M2NlZDNmMjgxNGQ5YTk0MjJlNzY2MGE4MGIxNWI4MjMxMWE5MTE4MTBkOTc5MGJmN2VmNmNmZGQ0ZmY0YWNkZA==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA==, or unsubscribe https://avanan.url-protection.com/v1/url?o=https%3A//github.com/notifications/unsubscribe-auth/AS2AFAQBEDGGF2MF4IUI4U3URJTYJANCNFSM4VIDDEMA&g=YzIzZDBlNmI2NjhiYzc2NQ==&h=OWVmMTA3ZjFjNWM4YTRhYjAyMzc3YmFjMmI4NTFiMjYyODFlZDE4M2QzNzM1NjBiNzc5ODRiY2YzZmIxOWIwYQ==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA== . Triage notifications on the go with GitHub Mobile for iOS https://avanan.url-protection.com/v1/url?o=https%3A//apps.apple.com/app/apple-store/id1477376905%3Fct%3Dnotification-email%26amp%3Bmt%3D8%26amp%3Bpt%3D524675&g=M2IzZmU0ZjJiY2RkNDM1Mw==&h=YWNhNGE3YjRlY2MyZjljNGFiOGMyMmY3Yzg0ZjE3ZDZiMjRkOTJlNmEyNGE3MTBiMzIwOTdiMDQ2MGNkYzJiMg==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA== or Android https://avanan.url-protection.com/v1/url?o=https%3A//play.google.com/store/apps/details%3Fid%3Dcom.github.android%26amp%3Breferrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&g=ZDQ0M2Q2OTY3NmEwZDVlMQ==&h=YWY4ZjkxMGU4MGM4ZmNkZjFmNzgxZGZhNzQ4ZjdjYTQyMGNhMmMwNzllNTQ1YWFjOGY4ZTllNjVlZWQ4MGViYQ==&%20p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA==.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

nahum-litvin-hs avatar Dec 17 '21 04:12 nahum-litvin-hs

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.

I also think that CB only works well in combination with defined container health checks.

jgrumboe avatar Dec 17 '21 08:12 jgrumboe

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset.

I am having this issue right now, and the failedTask count is being reset every time. My container fails to start way before it could respond to the healthcheck.

thule0 avatar Feb 03 '22 19:02 thule0

Im having this issue right now... Pipeline timing out, deploy rollout infinitely, I`ve to force a rollback (updating service task version) manually

leoddias avatar Feb 17 '22 14:02 leoddias

@leoddias @thule0 thank you for flagging this. Could you please reach out to me at [email protected] with more details so we can triage this.

vibhav-ag avatar Feb 17 '22 21:02 vibhav-ag

@vibhav-ag how can I help? Do you have trouble reproducing this?

It has always been like that for me: deploy a working container, then try to deploy a container that is fundamentally broken and cannot start, it keeps retrying, circuit breaker does not stop this process.

thule0 avatar Feb 17 '22 22:02 thule0

@thule0 Do you have container healthchecks configured in your taskdefinition?

jgrumboe avatar Feb 18 '22 08:02 jgrumboe

@jgrumboe I tried both with and without a defined healthcheck, same result.

thule0 avatar Feb 18 '22 09:02 thule0

@vibhav-ag @jgrumboe I`ve everything implemented trough IaC, and I have the following template for taskdefinition:

  TaskDefinition:
      Type: AWS::ECS::TaskDefinition
      Properties:
        Family: !Sub "${StackNamePrefix}-${ServiceName}"
        NetworkMode: awsvpc
        RequiresCompatibilities:
          - FARGATE
        Cpu: !Ref ContainerCpu
        Memory: !Ref ContainerMemory
        ExecutionRoleArn: !Ref ExecutionRoleArn
        TaskRoleArn: !Ref TaskRoleArn
        ContainerDefinitions:
          - Name: !Ref ServiceName
            Image: !Sub "{{resolve:ssm:${ApplicationImageParameter}}}"
            PortMappings:
              - ContainerPort: !Ref ContainerPort
            HealthCheck:
              Interval: !Ref HealthCheckInterval
              Retries: !Ref HealthCheckRetries
              StartPeriod: !Ref StartPeriod
              Timeout: !Ref HealthCheckTimeout
              Command:
                - CMD-SHELL
                - !Sub 'curl -f http://127.0.0.1:${ContainerPort}${HealthCheckPath} || exit 1'
            LogConfiguration:
              LogDriver: awslogs
              Options:
                awslogs-region: !Ref AWS::Region
                awslogs-group: !Ref LogGroup
                awslogs-stream-prefix: ecs
            Environment:
              - Name: ENV
                Value: !Ref Environment
              - Name: PORT 
                Value: !Ref ContainerPort 
              - Name: NODE_ENV
                Value: !Ref Environment
              - Name: SPRING_PROFILES_ACTIVE
                Value: !Ref Environment
              - Name: APP_TYPE
                Value: !Ref AppType
            DockerLabels:
              traefik.enable: true

As you can see I have container health checks and as you probaly know I dont have target group since I use traefik as router. Things that we use in this workload: Codepipeline with ecs deploy stage ECS Fargate with rollback enabled The issue happens in every deployment that fails on boot, and gave us the following message: "Stopped reason Essential container in task exited". Looping the new tasks infinitely (is necessary a new manual deployment at service, informing the previouse task def)

Let me know if you guys need more details

leoddias avatar Feb 21 '22 20:02 leoddias

Thanks @leoddias this is helpful- will look into this and circle back.

vibhav-ag avatar Feb 22 '22 00:02 vibhav-ag

I am facing another case that I think fits this topic. I have an app that applies db migrations on startup. There is now a deployment failing for 15mins because a bad migration, but failedTasks is always 1 despite it has retried many times now.

I have set ALB checks bc they are mandatory and also ECS health checks, all via CDK:

App error logs:

| 2022-03-08T10:40:50.082+00:00 | npm ERR! code ELIFECYCLE
| 2022-03-08T10:40:50.082+00:00 | npm ERR! errno 1
| 2022-03-08T10:40:50.085+00:00 | npm ERR! [email protected] start: `strapi start`
| 2022-03-08T10:40:50.086+00:00 | npm ERR! Exit status 1

ECS task exit error:

Stopped reason Essential container in task exited

Service deployment:

  "desiredCount": 1,
  "pendingCount": 1,
  "runningCount": 0,
  "failedTasks": 1,
  "createdAt": "2022-03-08T10:25:18.050000+00:00",
  "updatedAt": "2022-03-08T10:25:18.050000+00:00",
  "launchType": "FARGATE",
  "platformVersion": "1.4.0",
  "platformFamily": "Linux",

ayozemr avatar Mar 08 '22 10:03 ayozemr

We're facing this issue as well, so I'd like to add a concrete example to the reports here.

Note: Sorry for the WOT, I just want to make sure that I cover as much of what I have gathered as possible in the hopes that it will either be falsified by someone who has had even better insights or helps someone that is stuck with this issue.

The scenario

Let's say that the infrastructure looks like this:

Infrastructure

The tasks/containers consist of APIs that require details from Secrets Manager in order to connect to a database. The secrets are fetched from within the task itself as part of the initialization of the API. If it is unable to reach Secrets Manager for some reason, the API exits with a non-zero exit code.

Let's say that there is a new deployment to the service that adds a new secret without updating the permissions for the task appropriately, resulting in a permission issue. The API starts normally (the container enters RUNNING state) and within a few seconds it reaches the point where it is supposed to fetch the secrets, but it fails and exits. When this happens, the task status transitions into STOPPED with an Essential container in task exited error message, as expected.

The issue

If you were running the container locally, depending on your configuration you might expect the container to just perform a restart (restart-always). This is often fine if the error is an exception caused by something temporary, but in this case, the task will just exit again immediately.

According to the documentation for ECS Services, at least AFAIU, restart attempts for containers that repeatedly fail to enter a RUNNING state will be throttled. I can't find any similar information about containers that do enter a RUNNING state before failing, so I am assuming that it will just restart automatically in a loop similar to a restart-always policy. This is also what I have been observing in real situations so far.

In the described scenario, that means that the failing container will be infinitely restarted unless someone manually intervenes and updates the service to use a working (previous) task definition. The only way a developer would find out that this has happened is when the CI times out when checking for successful deployment, which can take up to 10 minutes for the AWS CLI by default in my experience.

The "solution"

In order to mitigate the above scenario, one could implement rollbacks. In a perfect world, a rollback feature would at least recognize that:

  • A new deployment is not able to reach a RUNNING state for more than n seconds
  • A successfully deployed & running task is not responding to health checks

The ECS Deployment Circuit Breaker and its rollback option seems to cover the above when reading the documentation:

rollback

Determines whether to configure Amazon ECS to roll back the service if a service deployment fails. If rollback is enabled, when a service deployment fails, the service is rolled back to the last deployment that completed successfully.

I think it's safe to say that the descriptions seems to imply that failing tasks will result in a failed deployment an subsequently a rollback. But this is just half the truth.

Clarification

After a chat with AWS support, this is what I have been able to establish about the situation:

What the documentation really wants to say is that tasks that fail immediately without ever reaching a RUNNING state will cause a service deployment to fail. That's the catch.

Tasks that successfully reaches a RUNNING state, be it for a few seconds, has to fail the associated health checks. If the task exits before it has any chance to answer to any of the health checks, an infinite loop begins.

SkySails avatar Mar 22 '22 16:03 SkySails

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.

@vibhav-ag , I can confirm this is not working as intended. The failedTasks count goes up, but then two seconds later is discarded or decremented.

  {
    "id": "ecs-svc/5835613358439999935",
    "status": "PRIMARY",
    "taskDefinition": "arn:aws:ecs:us-east-1:[redacted]:task-definition/fail_fast_testing:2",
    "desiredCount": 2,
    "pendingCount": 2,
    "runningCount": 0,
    "failedTasks": 2,
    "createdAt": "2022-04-13T17:10:06.266000-04:00",
    "updatedAt": "2022-04-14T11:55:11.607000-04:00",
{
   "id": "ecs-svc/5835613358439999935",
   "status": "PRIMARY",
   "taskDefinition": "arn:aws:ecs:us-east-1:[redacted]:task-definition/fail_fast_testing:2",
   "desiredCount": 2,
   "pendingCount": 1,
   "runningCount": 1,
   "failedTasks": 0,
   "createdAt": "2022-04-14T11:54:54.873000-04:00",
   "updatedAt": "2022-04-14T11:56:01.402000-04:00",

Can we get a status on this? Catching tasks that fail to run is a core use-case of this feature.

rocco-alchemy avatar Apr 14 '22 16:04 rocco-alchemy

I'd like to give +1 to this topic. As a long-term user of ECS, I was actually very confused that the current Deployment Circuit Breaker doesn't count tasks that fail with exit code 1 during deployment but only works for situations when the task is unable to be placed on the cluster entirely. That happens quite infrequently because once the execution policy is properly crafted for a service then the MAJORITY of the cases when a rollback is needed is when the new version of the application quickly fails e.g. due to a missing ENV variable.

Ideally, the tasks that were scheduled on the cluster, started and failed while the deployment was still running should be counted as failedTasks as well.

The deployment structure returned by the describe-services contains a rolloutState field that is set to IN_PROGRESS while the deployment is still running. A failedTasks value should never be reset to 0 if the rolloutState of a deployment is IN_PROGRESS. When the deployment is finished rolloutState is set to COMPLETED and from that moment a circuit breaker should be deactivated.

I think that everybody expects the Deployment Circuit Breaker to work pretty much as I described above.

Best Regards, Krzysztof

kszarlej avatar Apr 15 '22 17:04 kszarlej

I just stumbled upon this problem as well. Given a container image with basically CMD exit 2 does eventually fail, but failedTasks does not count every failed task and it takes about 30 minutes for it to realize that it is indeed broken. So it "semi-works".

Now, adjusting this to a more real-life scenario, doing CMD sleep 10 && exit 2 on the container (meaning, it starts to wire up whatever implementation it has, then eventually fail), the failedTasks count stays at 0.

Makes the whole feature only useful for obvious errors and missing dependencies or similar.

jishi avatar Apr 19 '22 15:04 jishi

Hi All, thanks for flagging the issue- we are triaging this and will circle back with an update here.

vibhav-ag avatar Apr 20 '22 14:04 vibhav-ag

Hello @vibhav-ag any update on that?

kszarlej avatar May 04 '22 08:05 kszarlej

@kszarlej thanks for following-up. We did identify some issues here and are working on making some changes- I will share an update on the thread once changes are rolled out.

vibhav-ag avatar May 04 '22 17:05 vibhav-ag

Hi, do we know when this will be resolved, this is costing us thousands of dollars per month because we have hundreds of ECS services and AWS Config recording configuration changes.

ninerealms avatar May 06 '22 09:05 ninerealms

@vibhav-ag Could you shed some more light on what did you find out and when we can expect some fixes to be live? If we won't have reliable automated rollbacks I might be forced to migrate maaaany ECS services to K8S unfortunately :(

kszarlej avatar May 17 '22 09:05 kszarlej

@kszarlej thanks for following up.

Sharing some additional context here: ECS Circuit Breaker distinguishes between Tasks that fail to launch and Tasks that fail shortly after starting (i.e fast-failing containers). Because of this, for scenarios where some Tasks in a service fail to launch while others fast-fail, the failedTasks count (the max of these 2 scenarios) can keep getting reset. To get to a more consistent experience, we are fixing this issue so that the failedTasks count (across both of these 2 scenarios) is only reset when a Task passes health checks (or runs for a minimum period of time in case no health checks are configured). I can't share a concrete timeline, but I can say that we are actively working on rolling out this change and I will share an update once it is available.

Separately, for automated rollbacks, another capability we are looking to add is integration with CloudWatch alarms so that if an alarm is triggered, ECS would rollback the deployment. This is further out, but we would love to hear if this would be valuable for your use case as well.

vibhav-ag avatar May 17 '22 16:05 vibhav-ag