containers-roadmap
containers-roadmap copied to clipboard
[ECS] Feature Request: Auto restart non-essential container (aka local container restart policy)
Currently, auto restarting non-essential container in a task is not supported. In some use case, although a container doesn't need to be running for a task to run (therefore it's set to be non-essential), it's desirable to try to keep it running by restarting it. Currently the workaround is to schedule a service separately for the non-essential container with daemon scheduling strategy, but this decouples the container from other containers in the task, so it's not very desirable.
+1
yeah if a monitoring container goes down, I dont want to fail the whole task, I just want the monitor to come back
Need this feature desperately...
docker run already supports this with the --restart unless-stopped option but cannot implement this using ecs. Any ideas?
This feature is absolutely needed
Thanks for your feedback everyone. Questions for the group: what would your desired behavior be for a container that fails repeatedly? Would you need to have an optional parameter for controlling the maximum number of restarts? Exponential backoff? Something else? Do you need notifications or logging that this is happening?
@coultn from my experience, ecs already does some exponential backoff, doesn't it? I would love a clear indication in the dashboard that a task is restarting (often) rather than it saying running. Maybe another column with the number of restarts since last deployment?
Also a cloud watch metric for healthiness, ie how long it has been up vs how it had been deployed
@coultn I always liked the way a similar concept is implemented in erlang supervisors restart strategy. The gist of it is to tolerate a given maximum number of restarts in a given period of time. If the container is constantly restarting it means something wrong is happening, but an occasional restart of an unessential process would not affect the rest of the system. It would also be nice to support this feature in Cloudformation templates as well.
@coultn I think exponential back-off with no maximum restarts, usual k8s approach. The back-off would increase until the period is long enough that continuous restarts pose negligible impact on ECS/Fargate, maybe 10-15m? The containers may we be failing from something external beyond their control, e.g. a 2 hour us-east-1 outage 😉 so it is better they don't give up and recover after external conditions return to normal.
As @FernandoMiguel said, there should be some way of monitoring for high restart counts, even if only at the Task/Service level. A log entry per restart is handy, particularly if it identifies specific container that restarted.
Hi! Is anybody aware of some workarounds for this one?
@morj, suggest:
- Make your main container just a monitor for the other sidecar ‘non-essential’ containers. If any sidecar container goes down, bail, and ECS will restart the whole task.
- Use AWS EKS, it is just a lot more advanced and has all this stuff covered already. I love ECS for Fargate, which has unique advantages, but plain ECS is a kinda primitive compared to EKS IMHO.
+1
This feels like a no-brainer. +1
+1
Is there a way to restart the non essential container without killing the entire task? I have a task with 2 containers, one essential and one non essential. The essential one is always up and running. I would like to use the ECS api (or some work around) to restart the non essential container without killing the task (I want my essential container to still be up and running).
Any plan on implementing restart the non essential container without killing the entire task?
I'll join the chorus, and say this would be an immensely useful feature. I would like the restarting behavior to be configurable, such as max number of restarts, exponential back-off.
That could save a lot of while true loops in container entrypoints.
Is there any further development on this? Coming from a Kubernetes world I thought this would be a given
Anecdotally: we got bitten by this recently. We have a task definition that comprises of one container serving web requests (the "essential" container), and several worker containers that weren't marked essential.
The worker containers crashed because of an intermittent database connectivity issue, and we were left with a frontend serving web requests, but no worker containers to fulfill the backend side of things.
Even a simple "restart=always" style of switch would have made all the difference.
Our team has the same challenge but for essential containers.
Our ECS Fargate Tasks are comprised of a microservice container (insert your use-case here) and several supporting essential sidecars: Datadog agent, Consul Connect agent, and Envoy. If one of those sidecars exits, the task is terminated, causing a sudden decrease in capacity. Restarting the whole task takes 70-120s, during which time our UX latency numbers increase, triggering monitoring alarms.
From experience, Consul Connect and Envoy processes (not containers) take 2-8 seconds to restart. Therefore, we see a clear benefit for ECS to restart a stopped container instead of terminating the task.
The comments above regarding a maximum restart limit are prudent. A container that has a permanent ABEND shouldn't restart forever.
+1
A must have feature!!
I ended up taking the while true (well actually, until true) approach of making the containers automatically restart themselves:
restart.sh:
#!/bin/bash
set -euo pipefail
# Wrapper script to ensure a process restarts itself if it exits non-zero
until "${@}"; do
printf '{"level": "FATAL", "timestamp": "%s", "event": "%s", "command": "%s"}' "$(date -Iseconds)" "restarting container" "${@}"
done
in the task definition:
{
"command": [
"./restart.sh", "some", "command", "goes", "here"
]
}
but it would be more preferable to be able to control this at the ECS level.
@alexlance We have a similar restart script that handles the following scenarios:
- Must restart a process that experiences an unsolicited exit
- A restart must pause for no less than a second to prevent an infinite restart loop consuming all the CPU for an incorrectly configured process
- Must pass the SIGTERM or SIGQUIT signal to the process
- Must not restart the process when it exits due to a SIGTERM or SIGQUIT
- Must handle signal/exit race conditions
Having the above logic built into ECS/Fargate is desirable because the lifecycle state is known by ECS and unsolicited process exits are easier to manage with that context.
@alexlance We have a similar restart script that handles the following scenarios:
- Must restart a process that experiences an unsolicited exit
- A restart must pause for no less than a second to prevent an infinite restart loop consuming all the CPU for an incorrectly configured process
- Must pass the SIGTERM or SIGQUIT signal to the process
- Must not restart the process when it exits due to a SIGTERM or SIGQUIT
- Must handle signal/exit race conditions
Having the above logic built into ECS/Fargate is desirable because the lifecycle state is known by ECS and unsolicited process exits are easier to manage with that context.
Would you care to share?
Company confidential. I have to jump through several hoops to share it. Plus it's written in bash instead of Go... nobody wants to see that :-)
+1
+22
++1
+1