containers-roadmap
containers-roadmap copied to clipboard
[ECS] [Deployment]: ECS deployment circuit breaker should deal with exceptional exit container
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Tell us about your request What do you want us to build?
Use Cloudformation to update ECS background task
Which service(s) is this request for? ECS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
As the demo of Deployment circuit breaker, the container start failed with error:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"exit\": executable file not found in $PATH": unknown.
Which would be handled by deployment circuit breaker.
However more common case is container start succeed but exceptional exit.
For example, the Dockerfile:
FROM alpine:latest
exit 1
The current situation is the container stoped with Essential container in task exit
and marked as failed, but when the second task start, the failedTasks
count will be reset to 1, which means the circuit breaker threshold will never triggered.
What we expected is, regard running but abnormal exit as failed, and not reset the failedTasks count, then meet the breaker threshold and roll back.
This issue could probably use a higher priority, since even the Circuit Breaker official demo (https://www.youtube.com/watch?v=Y2Ez9M7A95Y) doesn't work because of this behaviour.
It also makes the CB unreliable, as it doesn't catch all types of deployment failures. As an example, I have microservices that will throw an exception and exit early if there is an error in their DB connection string. With the current behaviour of CB, deploying one of those with a wrong DB string will churn on forever.
Any more visibility on this issue? This behavior kind of defeats the purpose of the circuit breaker feature - if we can't trust it to catch all types of ECS task failures, we'll need to implement our own fail-safes, alerts, and rollback functionality anyway.
Having a similar issue to @LeMikaelF where the app shuts down when a required environment variable is missing. The app just shuts down and the failedTasks is incremented to 1, but as soon as that happens, it is decremented back to 0 and the threshold is never crossed. If CBs don't work for this kind of situation then they are pretty useless to our team :(
Can someone explain why it decrements to begin with? I'm not sure I understand.
I just spent an entire day trying to figure out why my circuit breaker was never triggering before coming across this issue. I dutifully followed the documentation and built out the SNS topic and EventBridge rules and subscribed them to Datadog to send me notifications about when my deploys fail, only to discover that was all wasted effort because the circuit breaker is functionally useless.
I just need to know when my containers are spinning up, "running" for a few seconds, and dying before ever being marked healthy by the ALB they sit behind. This certainly seems like core functionality of a deployment circuit breaker, and the documentation absolutely misleads you into thinking that this is how the circuit breaker will behave. This paragraph says that the circuit breaker will trip if the ALB healthchecks mark the container as unhealthy, but if the container exits before the ALB healthchecks run enough times to mark it as unhealthy, then that container is considered deployed successfully and it just retries forever.
Even a circuit breaker case as naive as saying "if this deploy hasn't been marked as completed
in X minutes, mark it as failed
" would be beneficial for this specific case.
Ran into this issue the other day. Caused a failing deploy to keep failing vs rolling back. This was then followed by autoscaling trying to scale up the failing deployment (because it was more recent?). Then issues with scaling being too low on the old deploy that was still live until the failing deployment was taken care of with manual intervention.
Does anyone have a good workaround for this until its resolved?
@robert-put The workaround is to not use circuit breaker at all. Do something on your own. Projects like ecs-deploy may help with that.
Just ran into similar issue. Deployment circuit breaker: "enabled with rollback"
Updated task definition and the deployment was stuck in "In progress" for at least 30m, no events at all. Tried twice more, no further deployments or events, basically had to delete the service and create it again, not ideal to say the least..
Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.
this issue was the straw that broke the camel's back. we moved to kubernetes.
On Fri, Dec 17, 2021, 00:45 vibhav-ag @.***> wrote:
Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung https://avanan.url-protection.com/v1/url?o=https%3A//github.com/jtyoung&g=YjE3OTI1NTU3OGRlMTdjNw==&h=OWMzNTRjNjJmYzAwZjRjMWIxNzdkZGM0YzdlZmQ3YzllZmUwMmYwMjAxOTNlMmViZjBmMDE3YWFmMmJlOGM3Mw==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA== I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.
— Reply to this email directly, view it on GitHub https://avanan.url-protection.com/v1/url?o=https%3A//github.com/aws/containers-roadmap/issues/1206%23issuecomment-996251344&g=ZWVjM2QwYTIyZTFkZDA0MA==&h=M2NlZDNmMjgxNGQ5YTk0MjJlNzY2MGE4MGIxNWI4MjMxMWE5MTE4MTBkOTc5MGJmN2VmNmNmZGQ0ZmY0YWNkZA==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA==, or unsubscribe https://avanan.url-protection.com/v1/url?o=https%3A//github.com/notifications/unsubscribe-auth/AS2AFAQBEDGGF2MF4IUI4U3URJTYJANCNFSM4VIDDEMA&g=YzIzZDBlNmI2NjhiYzc2NQ==&h=OWVmMTA3ZjFjNWM4YTRhYjAyMzc3YmFjMmI4NTFiMjYyODFlZDE4M2QzNzM1NjBiNzc5ODRiY2YzZmIxOWIwYQ==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA== . Triage notifications on the go with GitHub Mobile for iOS https://avanan.url-protection.com/v1/url?o=https%3A//apps.apple.com/app/apple-store/id1477376905%3Fct%3Dnotification-email%26amp%3Bmt%3D8%26amp%3Bpt%3D524675&g=M2IzZmU0ZjJiY2RkNDM1Mw==&h=YWNhNGE3YjRlY2MyZjljNGFiOGMyMmY3Yzg0ZjE3ZDZiMjRkOTJlNmEyNGE3MTBiMzIwOTdiMDQ2MGNkYzJiMg==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA== or Android https://avanan.url-protection.com/v1/url?o=https%3A//play.google.com/store/apps/details%3Fid%3Dcom.github.android%26amp%3Breferrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&g=ZDQ0M2Q2OTY3NmEwZDVlMQ==&h=YWY4ZjkxMGU4MGM4ZmNkZjFmNzgxZGZhNzQ4ZjdjYTQyMGNhMmMwNzllNTQ1YWFjOGY4ZTllNjVlZWQ4MGViYQ==&%20p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA==.
You are receiving this because you are subscribed to this thread.Message ID: @.***>
Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.
I also think that CB only works well in combination with defined container health checks.
Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset.
I am having this issue right now, and the failedTask count is being reset every time. My container fails to start way before it could respond to the healthcheck.
Im having this issue right now... Pipeline timing out, deploy rollout infinitely, I`ve to force a rollback (updating service task version) manually
@leoddias @thule0 thank you for flagging this. Could you please reach out to me at [email protected] with more details so we can triage this.
@vibhav-ag how can I help? Do you have trouble reproducing this?
It has always been like that for me: deploy a working container, then try to deploy a container that is fundamentally broken and cannot start, it keeps retrying, circuit breaker does not stop this process.
@thule0 Do you have container healthchecks configured in your taskdefinition?
@jgrumboe I tried both with and without a defined healthcheck, same result.
@vibhav-ag @jgrumboe I`ve everything implemented trough IaC, and I have the following template for taskdefinition:
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Sub "${StackNamePrefix}-${ServiceName}"
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
ExecutionRoleArn: !Ref ExecutionRoleArn
TaskRoleArn: !Ref TaskRoleArn
ContainerDefinitions:
- Name: !Ref ServiceName
Image: !Sub "{{resolve:ssm:${ApplicationImageParameter}}}"
PortMappings:
- ContainerPort: !Ref ContainerPort
HealthCheck:
Interval: !Ref HealthCheckInterval
Retries: !Ref HealthCheckRetries
StartPeriod: !Ref StartPeriod
Timeout: !Ref HealthCheckTimeout
Command:
- CMD-SHELL
- !Sub 'curl -f http://127.0.0.1:${ContainerPort}${HealthCheckPath} || exit 1'
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-region: !Ref AWS::Region
awslogs-group: !Ref LogGroup
awslogs-stream-prefix: ecs
Environment:
- Name: ENV
Value: !Ref Environment
- Name: PORT
Value: !Ref ContainerPort
- Name: NODE_ENV
Value: !Ref Environment
- Name: SPRING_PROFILES_ACTIVE
Value: !Ref Environment
- Name: APP_TYPE
Value: !Ref AppType
DockerLabels:
traefik.enable: true
As you can see I have container health checks and as you probaly know I dont have target group since I use traefik as router. Things that we use in this workload: Codepipeline with ecs deploy stage ECS Fargate with rollback enabled The issue happens in every deployment that fails on boot, and gave us the following message: "Stopped reason Essential container in task exited". Looping the new tasks infinitely (is necessary a new manual deployment at service, informing the previouse task def)
Let me know if you guys need more details
Thanks @leoddias this is helpful- will look into this and circle back.
I am facing another case that I think fits this topic. I have an app that applies db migrations on startup. There is now a deployment failing for 15mins because a bad migration, but failedTasks is always 1 despite it has retried many times now.
I have set ALB checks bc they are mandatory and also ECS health checks, all via CDK:
App error logs:
| 2022-03-08T10:40:50.082+00:00 | npm ERR! code ELIFECYCLE
| 2022-03-08T10:40:50.082+00:00 | npm ERR! errno 1
| 2022-03-08T10:40:50.085+00:00 | npm ERR! [email protected] start: `strapi start`
| 2022-03-08T10:40:50.086+00:00 | npm ERR! Exit status 1
ECS task exit error:
Stopped reason Essential container in task exited
Service deployment:
"desiredCount": 1,
"pendingCount": 1,
"runningCount": 0,
"failedTasks": 1,
"createdAt": "2022-03-08T10:25:18.050000+00:00",
"updatedAt": "2022-03-08T10:25:18.050000+00:00",
"launchType": "FARGATE",
"platformVersion": "1.4.0",
"platformFamily": "Linux",
We're facing this issue as well, so I'd like to add a concrete example to the reports here.
Note: Sorry for the WOT, I just want to make sure that I cover as much of what I have gathered as possible in the hopes that it will either be falsified by someone who has had even better insights or helps someone that is stuck with this issue.
The scenario
Let's say that the infrastructure looks like this:
The tasks/containers consist of APIs that require details from Secrets Manager in order to connect to a database. The secrets are fetched from within the task itself as part of the initialization of the API. If it is unable to reach Secrets Manager for some reason, the API exits with a non-zero exit code.
Let's say that there is a new deployment to the service that adds a new secret without updating the permissions for the task appropriately, resulting in a permission issue. The API starts normally (the container enters RUNNING
state) and within a few seconds it reaches the point where it is supposed to fetch the secrets, but it fails and exits. When this happens, the task status transitions into STOPPED
with an Essential container in task exited
error message, as expected.
The issue
If you were running the container locally, depending on your configuration you might expect the container to just perform a restart (restart-always
). This is often fine if the error is an exception caused by something temporary, but in this case, the task will just exit again immediately.
According to the documentation for ECS Services, at least AFAIU, restart attempts for containers that repeatedly fail to enter a RUNNING
state will be throttled. I can't find any similar information about containers that do enter a RUNNING
state before failing, so I am assuming that it will just restart automatically in a loop similar to a restart-always
policy. This is also what I have been observing in real situations so far.
In the described scenario, that means that the failing container will be infinitely restarted unless someone manually intervenes and updates the service to use a working (previous) task definition. The only way a developer would find out that this has happened is when the CI times out when checking for successful deployment, which can take up to 10 minutes for the AWS CLI by default in my experience.
The "solution"
In order to mitigate the above scenario, one could implement rollbacks. In a perfect world, a rollback feature would at least recognize that:
- A new deployment is not able to reach a RUNNING state for more than
n
seconds - A successfully deployed & running task is not responding to health checks
The ECS Deployment Circuit Breaker and its rollback
option seems to cover the above when reading the documentation:
rollback
Determines whether to configure Amazon ECS to roll back the service if a service deployment fails. If rollback is enabled, when a service deployment fails, the service is rolled back to the last deployment that completed successfully.
I think it's safe to say that the descriptions seems to imply that failing tasks will result in a failed deployment an subsequently a rollback. But this is just half the truth.
Clarification
After a chat with AWS support, this is what I have been able to establish about the situation:
What the documentation really wants to say is that tasks that fail immediately without ever reaching a RUNNING
state will cause a service deployment to fail. That's the catch.
Tasks that successfully reaches a RUNNING
state, be it for a few seconds, has to fail the associated health checks. If the task exits before it has any chance to answer to any of the health checks, an infinite loop begins.
Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.
@vibhav-ag ,
I can confirm this is not working as intended. The failedTasks
count goes up, but then two seconds later is discarded or decremented.
{
"id": "ecs-svc/5835613358439999935",
"status": "PRIMARY",
"taskDefinition": "arn:aws:ecs:us-east-1:[redacted]:task-definition/fail_fast_testing:2",
"desiredCount": 2,
"pendingCount": 2,
"runningCount": 0,
"failedTasks": 2,
"createdAt": "2022-04-13T17:10:06.266000-04:00",
"updatedAt": "2022-04-14T11:55:11.607000-04:00",
{
"id": "ecs-svc/5835613358439999935",
"status": "PRIMARY",
"taskDefinition": "arn:aws:ecs:us-east-1:[redacted]:task-definition/fail_fast_testing:2",
"desiredCount": 2,
"pendingCount": 1,
"runningCount": 1,
"failedTasks": 0,
"createdAt": "2022-04-14T11:54:54.873000-04:00",
"updatedAt": "2022-04-14T11:56:01.402000-04:00",
Can we get a status on this? Catching tasks that fail to run is a core use-case of this feature.
I'd like to give +1 to this topic. As a long-term user of ECS, I was actually very confused that the current Deployment Circuit Breaker doesn't count tasks that fail with exit code 1 during deployment but only works for situations when the task is unable to be placed on the cluster entirely. That happens quite infrequently because once the execution policy is properly crafted for a service then the MAJORITY of the cases when a rollback is needed is when the new version of the application quickly fails e.g. due to a missing ENV variable.
Ideally, the tasks that were scheduled on the cluster, started and failed while the deployment was still running should be counted as failedTasks as well.
The deployment structure returned by the describe-services
contains a rolloutState
field that is set to IN_PROGRESS
while the deployment is still running. A failedTasks value should never be reset to 0 if the rolloutState
of a deployment is IN_PROGRESS
. When the deployment is finished rolloutState
is set to COMPLETED
and from that moment a circuit breaker should be deactivated.
I think that everybody expects the Deployment Circuit Breaker to work pretty much as I described above.
Best Regards, Krzysztof
I just stumbled upon this problem as well. Given a container image with basically CMD exit 2
does eventually fail, but failedTasks
does not count every failed task and it takes about 30 minutes for it to realize that it is indeed broken. So it "semi-works".
Now, adjusting this to a more real-life scenario, doing CMD sleep 10 && exit 2
on the container (meaning, it starts to wire up whatever implementation it has, then eventually fail), the failedTasks count stays at 0.
Makes the whole feature only useful for obvious errors and missing dependencies or similar.
Hi All, thanks for flagging the issue- we are triaging this and will circle back with an update here.
Hello @vibhav-ag any update on that?
@kszarlej thanks for following-up. We did identify some issues here and are working on making some changes- I will share an update on the thread once changes are rolled out.
Hi, do we know when this will be resolved, this is costing us thousands of dollars per month because we have hundreds of ECS services and AWS Config recording configuration changes.
@vibhav-ag Could you shed some more light on what did you find out and when we can expect some fixes to be live? If we won't have reliable automated rollbacks I might be forced to migrate maaaany ECS services to K8S unfortunately :(
@kszarlej thanks for following up.
Sharing some additional context here: ECS Circuit Breaker distinguishes between Tasks that fail to launch and Tasks that fail shortly after starting (i.e fast-failing containers). Because of this, for scenarios where some Tasks in a service fail to launch while others fast-fail, the failedTasks count (the max of these 2 scenarios) can keep getting reset. To get to a more consistent experience, we are fixing this issue so that the failedTasks count (across both of these 2 scenarios) is only reset when a Task passes health checks (or runs for a minimum period of time in case no health checks are configured). I can't share a concrete timeline, but I can say that we are actively working on rolling out this change and I will share an update once it is available.
Separately, for automated rollbacks, another capability we are looking to add is integration with CloudWatch alarms so that if an alarm is triggered, ECS would rollback the deployment. This is further out, but we would love to hear if this would be valuable for your use case as well.