docker-autoheal icon indicating copy to clipboard operation
docker-autoheal copied to clipboard

Respect manual stopped containers

Open christian-weiss opened this issue 5 years ago • 4 comments

First of all: thanks for this great little helper.

I love the fact that autoheal restarts unhealthy containers and died containers which where previously unhealthy if it has the propper label. And i love the fact that it do not touch other stopped containers.

In many situations a restart is a good thing to heal a service, but sometimes it will end-up in a crashloop forever. In that situations a admin may want to stop (temp.) the container and keep the container for further investigation. In that case you do not want to let autoheal to start the container again. Autoheal should respect manual stopped containers. A admin may want to stop the crashloop to prevent flooding the central log server with all the restarts logs. There are valid reasons to keep the container (but stopped), e.g. to investigate via docker logs / docker container diff / docker inspect . The investigation will be maybe postponed to a later point in time.

If i catch the crashlooping container in "health: starting" or "healthy" phase then it will stay stopped. But if i am a little late in a crashloop (unhealthy) then the container will be restarted, due to the unhealthy status, even if i stopped it manually via docker container stop. Another admin is maybe not aware of autoheal and may wonder that the container comes back. Least surprise principle.

I am not sure if it is possible for autoheal to distinguish between a died container and a manual stopped container - if not, then maybe a crashloop counter, crashloop threshold and a crashloop threshold period (stored in autoheal) could help to detect crashloops and then stop restarting. Alternative: Track the last 5 periods between restarts and if they are equal +/- 5 seconds stop restarting. All these parameters should be configurable as ENV for autoheal container (general setting) and as ENV for labeled / tracked containers.

christian-weiss avatar Jan 10 '19 22:01 christian-weiss

Thanks for writing this up. It used to respect manual stops, sounds like a regression might have happened, or a new bug on the docker side. Creating a test case would be in order to resolve this.

Checkout: https://docs.docker.com/engine/api/v1.25/#operation/ContainerList Maybe we need to add an additional filter for the status to ensure manually stopped containers are not included. The line you'd want to play with is here: https://github.com/willfarrell/docker-autoheal/blob/master/docker-entrypoint#L35

Would you like to take a stab at this?

I thought about the back-off timer back when I wrote it, felt it was too complex and based on feedback decided not to include. The general feeling was, it should keep spitting out logs and raising alarms till it's resolved in some way.

willfarrell avatar Jan 11 '19 06:01 willfarrell

This issue also appears to cause autoheal not to place nicely if it is one of many containers in a docker-compose project.

Sporadically when running docker-compose up --force-recreate, I get errors like this, which seem to indicate that autoheal is restarting containers that docker-compose has stopped and is about to remove, which rightfully confuses docker-compose.

>docker-compose up -d --force-recreate my-service
Recreating my-project_my-service_1 ... error

ERROR: for my-project_my-service_1  You cannot remove a running container 208f7eeac16315bcb74682384890a57f7201e4babe9cd40a9064df241eba51e1. Stop the container before attempting removal or force remove

ERROR: for my-service  You cannot remove a running container 208f7eeac16315bcb74682384890a57f7201e4babe9cd40a9064df241eba51e1. Stop the container before attempting removal or force remove
ERROR: Encountered errors while bringing up the project.

RobinsonWM avatar Aug 29 '19 19:08 RobinsonWM

@willfarrell I confirm that this is still happening on the latest version. Actually it's easy to reproduce:

  1. Create a container that fails every healthcheck
  2. Set both healtcheck interval to 1s and autoheal.stop.timeout to 1
  3. Rebuild the container (I'm doing a docker-compose up --build -d) and the error message above usually occurs

Do you have any idea if this problem can be solved?

UPDATE: I actually ended up running two instances of my container in parallel which is not very fortunate:

docker-compose ps:

               Name                             Command                       State           Ports
---------------------------------------------------------------------------------------------------
1db971c6f171_test-unhealthy_node_1   docker-entrypoint.sh node  ...   Up (health: starting)
test-unhealthy_node_1                docker-entrypoint.sh node  ...   Up (unhealthy)

After this I can't seem to be able to stop / remove my containers at all. Running docker-compose down fails and both containers are still running indefinitely.

adams-family avatar Aug 08 '20 17:08 adams-family

I created a GIF recording to illustrate the problem. The only way I can stop my container is to kill autoheal first:

docker autoheal problem

adams-family avatar Aug 08 '20 17:08 adams-family