containers-roadmap [ECS] Feature Request: Auto restart non-essential container (aka local container restart policy)

[ECS] Feature Request: Auto restart non-essential container (aka local container restart policy)

Open fenxiong opened this issue 5 years ago • 36 comments

Currently, auto restarting non-essential container in a task is not supported. In some use case, although a container doesn't need to be running for a task to run (therefore it's set to be non-essential), it's desirable to try to keep it running by restarting it. Currently the workaround is to schedule a service separately for the non-essential container with daemon scheduling strategy, but this decouples the container from other containers in the task, so it's not very desirable.

Jan 09 '19 17:01 fenxiong

Jan 16 '19 11:01 aledelgo

yeah if a monitoring container goes down, I dont want to fail the whole task, I just want the monitor to come back

Apr 29 '19 22:04 wontonst

Need this feature desperately...

docker run already supports this with the --restart unless-stopped option but cannot implement this using ecs. Any ideas?

May 22 '19 09:05 devaroop

This feature is absolutely needed

Jun 13 '19 12:06 nilroy

Thanks for your feedback everyone. Questions for the group: what would your desired behavior be for a container that fails repeatedly? Would you need to have an optional parameter for controlling the maximum number of restarts? Exponential backoff? Something else? Do you need notifications or logging that this is happening?

Jun 20 '19 22:06 coultn

@coultn from my experience, ecs already does some exponential backoff, doesn't it? I would love a clear indication in the dashboard that a task is restarting (often) rather than it saying running. Maybe another column with the number of restarts since last deployment?

Also a cloud watch metric for healthiness, ie how long it has been up vs how it had been deployed

Jun 21 '19 06:06 FernandoMiguel

@coultn I always liked the way a similar concept is implemented in erlang supervisors restart strategy. The gist of it is to tolerate a given maximum number of restarts in a given period of time. If the container is constantly restarting it means something wrong is happening, but an occasional restart of an unessential process would not affect the rest of the system. It would also be nice to support this feature in Cloudformation templates as well.

Jun 22 '19 17:06 mehdi-abbad

@coultn I think exponential back-off with no maximum restarts, usual k8s approach. The back-off would increase until the period is long enough that continuous restarts pose negligible impact on ECS/Fargate, maybe 10-15m? The containers may we be failing from something external beyond their control, e.g. a 2 hour us-east-1 outage 😉 so it is better they don't give up and recover after external conditions return to normal.

As @FernandoMiguel said, there should be some way of monitoring for high restart counts, even if only at the Task/Service level. A log entry per restart is handy, particularly if it identifies specific container that restarted.

Jun 23 '19 03:06 whereisaaron

Hi! Is anybody aware of some workarounds for this one?

Jul 29 '19 13:07 morj

@morj, suggest:

Make your main container just a monitor for the other sidecar ‘non-essential’ containers. If any sidecar container goes down, bail, and ECS will restart the whole task.
Use AWS EKS, it is just a lot more advanced and has all this stuff covered already. I love ECS for Fargate, which has unique advantages, but plain ECS is a kinda primitive compared to EKS IMHO.

Jul 29 '19 16:07 whereisaaron

Oct 15 '19 15:10 pparth

This feels like a no-brainer. +1

Oct 16 '19 16:10 rmalleman

Dec 01 '19 04:12 jeswanthamazon

Is there a way to restart the non essential container without killing the entire task? I have a task with 2 containers, one essential and one non essential. The essential one is always up and running. I would like to use the ECS api (or some work around) to restart the non essential container without killing the task (I want my essential container to still be up and running).

Dec 06 '19 16:12 timurridjanovic

Any plan on implementing restart the non essential container without killing the entire task?

Jul 03 '20 16:07 lubingfeng

I'll join the chorus, and say this would be an immensely useful feature. I would like the restarting behavior to be configurable, such as max number of restarts, exponential back-off.

Jan 27 '21 16:01 shawnrushefsky

That could save a lot of while true loops in container entrypoints.

May 21 '21 13:05 b1-88er

Is there any further development on this? Coming from a Kubernetes world I thought this would be a given

Jun 07 '21 09:06 c-ameron

Anecdotally: we got bitten by this recently. We have a task definition that comprises of one container serving web requests (the "essential" container), and several worker containers that weren't marked essential.

The worker containers crashed because of an intermittent database connectivity issue, and we were left with a frontend serving web requests, but no worker containers to fulfill the backend side of things.

Even a simple "restart=always" style of switch would have made all the difference.

Sep 05 '21 01:09 alexlance

Our team has the same challenge but for essential containers.

Our ECS Fargate Tasks are comprised of a microservice container (insert your use-case here) and several supporting essential sidecars: Datadog agent, Consul Connect agent, and Envoy. If one of those sidecars exits, the task is terminated, causing a sudden decrease in capacity. Restarting the whole task takes 70-120s, during which time our UX latency numbers increase, triggering monitoring alarms.

From experience, Consul Connect and Envoy processes (not containers) take 2-8 seconds to restart. Therefore, we see a clear benefit for ECS to restart a stopped container instead of terminating the task.

The comments above regarding a maximum restart limit are prudent. A container that has a permanent ABEND shouldn't restart forever.

Nov 10 '21 14:11 GordonMcKinney

Feb 23 '22 11:02 ghost

A must have feature!!

Mar 18 '22 16:03 tovbinm

I ended up taking the while true (well actually, until true) approach of making the containers automatically restart themselves:

restart.sh:

#!/bin/bash
set -euo pipefail

# Wrapper script to ensure a process restarts itself if it exits non-zero
until "${@}"; do
  printf '{"level": "FATAL", "timestamp": "%s", "event": "%s", "command": "%s"}' "$(date -Iseconds)" "restarting container" "${@}"
done

in the task definition:

{
   "command": [
      "./restart.sh", "some", "command", "goes", "here"
    ]
}

but it would be more preferable to be able to control this at the ECS level.

Mar 18 '22 22:03 alexlance

@alexlance We have a similar restart script that handles the following scenarios:

Must restart a process that experiences an unsolicited exit
A restart must pause for no less than a second to prevent an infinite restart loop consuming all the CPU for an incorrectly configured process
Must pass the SIGTERM or SIGQUIT signal to the process
Must not restart the process when it exits due to a SIGTERM or SIGQUIT
Must handle signal/exit race conditions

Having the above logic built into ECS/Fargate is desirable because the lifecycle state is known by ECS and unsolicited process exits are easier to manage with that context.

Mar 20 '22 13:03 GordonMcKinney

@alexlance We have a similar restart script that handles the following scenarios:

Must restart a process that experiences an unsolicited exit

A restart must pause for no less than a second to prevent an infinite restart loop consuming all the CPU for an incorrectly configured process

Must pass the SIGTERM or SIGQUIT signal to the process

Must not restart the process when it exits due to a SIGTERM or SIGQUIT

Must handle signal/exit race conditions

Having the above logic built into ECS/Fargate is desirable because the lifecycle state is known by ECS and unsolicited process exits are easier to manage with that context.

Would you care to share?