containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] Feature Request: Auto restart non-essential container (aka local container restart policy)

Open fenxiong opened this issue 5 years ago • 36 comments

Currently, auto restarting non-essential container in a task is not supported. In some use case, although a container doesn't need to be running for a task to run (therefore it's set to be non-essential), it's desirable to try to keep it running by restarting it. Currently the workaround is to schedule a service separately for the non-essential container with daemon scheduling strategy, but this decouples the container from other containers in the task, so it's not very desirable.

fenxiong avatar Jan 09 '19 17:01 fenxiong

+1

aledelgo avatar Jan 16 '19 11:01 aledelgo

yeah if a monitoring container goes down, I dont want to fail the whole task, I just want the monitor to come back

wontonst avatar Apr 29 '19 22:04 wontonst

Need this feature desperately...

docker run already supports this with the --restart unless-stopped option but cannot implement this using ecs. Any ideas?

devaroop avatar May 22 '19 09:05 devaroop

This feature is absolutely needed

nilroy avatar Jun 13 '19 12:06 nilroy

Thanks for your feedback everyone. Questions for the group: what would your desired behavior be for a container that fails repeatedly? Would you need to have an optional parameter for controlling the maximum number of restarts? Exponential backoff? Something else? Do you need notifications or logging that this is happening?

coultn avatar Jun 20 '19 22:06 coultn

@coultn from my experience, ecs already does some exponential backoff, doesn't it? I would love a clear indication in the dashboard that a task is restarting (often) rather than it saying running. Maybe another column with the number of restarts since last deployment?

Also a cloud watch metric for healthiness, ie how long it has been up vs how it had been deployed

FernandoMiguel avatar Jun 21 '19 06:06 FernandoMiguel

@coultn I always liked the way a similar concept is implemented in erlang supervisors restart strategy. The gist of it is to tolerate a given maximum number of restarts in a given period of time. If the container is constantly restarting it means something wrong is happening, but an occasional restart of an unessential process would not affect the rest of the system. It would also be nice to support this feature in Cloudformation templates as well.

mehdi-abbad avatar Jun 22 '19 17:06 mehdi-abbad

@coultn I think exponential back-off with no maximum restarts, usual k8s approach. The back-off would increase until the period is long enough that continuous restarts pose negligible impact on ECS/Fargate, maybe 10-15m? The containers may we be failing from something external beyond their control, e.g. a 2 hour us-east-1 outage 😉 so it is better they don't give up and recover after external conditions return to normal.

As @FernandoMiguel said, there should be some way of monitoring for high restart counts, even if only at the Task/Service level. A log entry per restart is handy, particularly if it identifies specific container that restarted.

whereisaaron avatar Jun 23 '19 03:06 whereisaaron

Hi! Is anybody aware of some workarounds for this one?

morj avatar Jul 29 '19 13:07 morj

@morj, suggest:

  1. Make your main container just a monitor for the other sidecar ‘non-essential’ containers. If any sidecar container goes down, bail, and ECS will restart the whole task.
  2. Use AWS EKS, it is just a lot more advanced and has all this stuff covered already. I love ECS for Fargate, which has unique advantages, but plain ECS is a kinda primitive compared to EKS IMHO.

whereisaaron avatar Jul 29 '19 16:07 whereisaaron

+1

pparth avatar Oct 15 '19 15:10 pparth

This feels like a no-brainer. +1

rmalleman avatar Oct 16 '19 16:10 rmalleman

+1

jeswanthamazon avatar Dec 01 '19 04:12 jeswanthamazon

Is there a way to restart the non essential container without killing the entire task? I have a task with 2 containers, one essential and one non essential. The essential one is always up and running. I would like to use the ECS api (or some work around) to restart the non essential container without killing the task (I want my essential container to still be up and running).

timurridjanovic avatar Dec 06 '19 16:12 timurridjanovic

Any plan on implementing restart the non essential container without killing the entire task?

lubingfeng avatar Jul 03 '20 16:07 lubingfeng

I'll join the chorus, and say this would be an immensely useful feature. I would like the restarting behavior to be configurable, such as max number of restarts, exponential back-off.

shawnrushefsky avatar Jan 27 '21 16:01 shawnrushefsky

That could save a lot of while true loops in container entrypoints.

b1-88er avatar May 21 '21 13:05 b1-88er

Is there any further development on this? Coming from a Kubernetes world I thought this would be a given

c-ameron avatar Jun 07 '21 09:06 c-ameron

Anecdotally: we got bitten by this recently. We have a task definition that comprises of one container serving web requests (the "essential" container), and several worker containers that weren't marked essential.

The worker containers crashed because of an intermittent database connectivity issue, and we were left with a frontend serving web requests, but no worker containers to fulfill the backend side of things.

Even a simple "restart=always" style of switch would have made all the difference.

alexlance avatar Sep 05 '21 01:09 alexlance

Our team has the same challenge but for essential containers.

Our ECS Fargate Tasks are comprised of a microservice container (insert your use-case here) and several supporting essential sidecars: Datadog agent, Consul Connect agent, and Envoy. If one of those sidecars exits, the task is terminated, causing a sudden decrease in capacity. Restarting the whole task takes 70-120s, during which time our UX latency numbers increase, triggering monitoring alarms.

From experience, Consul Connect and Envoy processes (not containers) take 2-8 seconds to restart. Therefore, we see a clear benefit for ECS to restart a stopped container instead of terminating the task.

The comments above regarding a maximum restart limit are prudent. A container that has a permanent ABEND shouldn't restart forever.

GordonMcKinney avatar Nov 10 '21 14:11 GordonMcKinney

+1

ghost avatar Feb 23 '22 11:02 ghost

A must have feature!!

tovbinm avatar Mar 18 '22 16:03 tovbinm

I ended up taking the while true (well actually, until true) approach of making the containers automatically restart themselves:

restart.sh:

#!/bin/bash
set -euo pipefail

# Wrapper script to ensure a process restarts itself if it exits non-zero
until "${@}"; do
  printf '{"level": "FATAL", "timestamp": "%s", "event": "%s", "command": "%s"}' "$(date -Iseconds)" "restarting container" "${@}"
done

in the task definition:

{
   "command": [
      "./restart.sh", "some", "command", "goes", "here"
    ]
}

but it would be more preferable to be able to control this at the ECS level.

alexlance avatar Mar 18 '22 22:03 alexlance

@alexlance We have a similar restart script that handles the following scenarios:

  • Must restart a process that experiences an unsolicited exit
  • A restart must pause for no less than a second to prevent an infinite restart loop consuming all the CPU for an incorrectly configured process
  • Must pass the SIGTERM or SIGQUIT signal to the process
  • Must not restart the process when it exits due to a SIGTERM or SIGQUIT
  • Must handle signal/exit race conditions

Having the above logic built into ECS/Fargate is desirable because the lifecycle state is known by ECS and unsolicited process exits are easier to manage with that context.

GordonMcKinney avatar Mar 20 '22 13:03 GordonMcKinney

@alexlance We have a similar restart script that handles the following scenarios:

  • Must restart a process that experiences an unsolicited exit
  • A restart must pause for no less than a second to prevent an infinite restart loop consuming all the CPU for an incorrectly configured process
  • Must pass the SIGTERM or SIGQUIT signal to the process
  • Must not restart the process when it exits due to a SIGTERM or SIGQUIT
  • Must handle signal/exit race conditions

Having the above logic built into ECS/Fargate is desirable because the lifecycle state is known by ECS and unsolicited process exits are easier to manage with that context.

Would you care to share?

cwhyland-jetty avatar Mar 10 '23 21:03 cwhyland-jetty

Company confidential. I have to jump through several hoops to share it. Plus it's written in bash instead of Go... nobody wants to see that :-)

GordonMcKinney avatar Mar 10 '23 22:03 GordonMcKinney

+1

pp-assis avatar May 23 '23 18:05 pp-assis

+22

Menion93 avatar Jun 16 '23 09:06 Menion93

++1

henryagbo avatar Jun 22 '23 11:06 henryagbo

+1

savroman avatar Jul 19 '23 16:07 savroman