service-fabric icon indicating copy to clipboard operation
service-fabric copied to clipboard

Docker randomly stucks in restart loop for a couple days, and then works again.

Open julioas09 opened this issue 5 years ago • 6 comments

Running an standalone WS2016 18-node cluster. Without any noticeable change or interaction, docker stops working on one of the nodes. For some reason the client is not able to connect to the daemon. The daemon seems to be in a restart loop, I can see it running for ~5min and then it disappears for some seconds and comes up again.

docker-er

I tried to uninstall and reinstall docker (v19.03.05). But after this, it behaves similar: Dockerd is restarting continuously.

When trying to run a docker command (docker info/docker ps) it throws one of this errors: • “error during connect: Get http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.40/containers/json: open //./pipe/docker_engine: The system cannot find the file specified. In the default daemon configuration on Windows, the docker client must be run elevated to connect. This error may also indicate that the docker daemon is not running.” • “Cannot connect to the Docker daemon at npipe:////./pipe/docker_engine”

After a couple days, the problem seems to solve itself without any interaction. This has happened in different environments (with similar configuration), in different nodes, multiple times (at least 3 or 4). We have seen it only since July/August 2019.

On the problematic node, checking SF/Log/_sf_docker_logs folder this is what I find repeatedly:

time="2020-01-07T10:53:05.006845100+01:00" level=info msg="Starting up"
time="2020-01-07T10:53:05.008992600+01:00" level=warning msg="[!] DON'T BIND ON ANY IP ADDRESS WITHOUT setting --tlsverify IF YOU DON'T KNOW WHAT YOU'RE DOING [!]"
time="2020-01-07T10:53:05.009763800+01:00" level=debug msg="Listener created for HTTP on tcp (localhost:2375)"
time="2020-01-07T10:53:05.009763800+01:00" level=debug msg="Listener created for HTTP on npipe (//./pipe/docker_engine)"
time="2020-01-07T10:53:05.015762100+01:00" level=info msg="Windows default isolation mode: process"
time="2020-01-07T10:53:05.015762100+01:00" level=debug msg="Stackdump - waiting signal at Global\\stackdump-5952"
time="2020-01-07T10:53:05.016896600+01:00" level=debug msg="Using default logging driver json-file"
time="2020-01-07T10:53:05.016896600+01:00" level=debug msg="[graphdriver] trying provided driver: windowsfilter"
time="2020-01-07T10:53:05.016896600+01:00" level=debug msg="WindowsGraphDriver InitFilter at E:\\docker\\windowsfilter"
time="2020-01-07T10:53:05.016896600+01:00" level=debug msg="Initialized graph driver windowsfilter"
time="2020-01-07T10:53:05.163775300+01:00" level=debug msg="Max Concurrent Downloads: 3"
time="2020-01-07T10:53:05.163775300+01:00" level=debug msg="Max Concurrent Uploads: 5"
time="2020-01-07T10:53:05.163775300+01:00" level=info msg="Loading containers: start."
time="2020-01-07T10:53:05.164770300+01:00" level=error msg="Failed to load container 8b11bb8f0ca8db369b8beb3247bf622d3469e439c230945cd038a16fd910476e: open E:\\docker\\containers\\8b11bb8f0ca8db369b8beb3247bf622d3469e439c230945cd038a16fd910476e\\config.v2.json: The system cannot find the file specified."
time="2020-01-07T10:53:05.164770300+01:00" level=error msg="Failed to load container 013ac8373b7b24122a38bd3c7c460b7ccd3c18786e540c5e12cab9effcb9283a: open E:\\docker\\containers\\013ac8373b7b24122a38bd3c7c460b7ccd3c18786e540c5e12cab9effcb9283a\\config.v2.json: The system cannot find the file specified."
time="2020-01-07T10:53:05.164770300+01:00" level=debug msg="Option Experimental: false"
time="2020-01-07T10:53:05.164770300+01:00" level=debug msg="Option DefaultDriver: nat"
time="2020-01-07T10:53:05.164770300+01:00" level=debug msg="Option DefaultNetwork: nat"
time="2020-01-07T10:53:05.164770300+01:00" level=debug msg="Network Control Plane MTU: 1500"
time="2020-01-07T10:53:05.164770300+01:00" level=info msg="Restoring existing overlay networks from HNS into docker"
time="2020-01-07T10:53:05.164770300+01:00" level=debug msg="[GET]=>[/networks/] Request : "
time="2020-01-07T10:55:05.218994500+01:00" level=debug msg="Network  nat (7080ee0) restored"
time="2020-01-07T10:55:05.237995700+01:00" level=debug msg="[GET]=>[/networks/] Request : "
time="2020-01-07T10:57:05.258193300+01:00" level=debug msg="daemon configured with a 15 seconds minimum shutdown timeout"
time="2020-01-07T10:57:05.258193300+01:00" level=debug msg="start clean shutdown of all containers with a 15 seconds timeout..."

Expected Behavior

Service Fabric properly handles Docker daemon

Current Behavior

Docker daemon gets stuck in a restart loop for ~48h wich eventually solves itself. In this restart loop, the dockerd process seems to be stopped and restarted by SF (each ~5 min). SF is not able to run any container on that node. Running docker commands it is not able to connect to the daemon.

Context (Environment)

Production Standalone 18 node cluster

Docker version

19.03.5

Service Fabric Runtime:

6.5.664.9590

Operating System :

Windows Server 2016 (Version 1607 OS Build 14393.3181)

Cluster Size :

18 nodes

Possible Workaround

Wait for a couple days and it seems to be solved without any interaction, but the problem may resurface later.

julioas09 avatar Jan 07 '20 12:01 julioas09

When docker is restarting continuously, or you notice it not working when you issue commands, are all the apps down on that particular node? Or are the container apps up and you just can't issue commands to Docker?

Curious if it can't process commands as opposed to just being completely gone.

aryamsft avatar Feb 12 '20 18:02 aryamsft

I've been investigating this problem and I've found the memory leak. Currently have a fix in the works.

aryamsft avatar Mar 02 '20 20:03 aryamsft

So this is directly related with issue 1594 then? If that is the case I understand it will be addressed in version 7.1?

To answer your previous question, all the apps go down when this happens.

julioas09 avatar Mar 09 '20 14:03 julioas09

No I don't think microsoft/service-fabric-issues#1594 is related, I accidentally commented above

aryamsft avatar Mar 09 '20 17:03 aryamsft

@julioas09 @aryamsft is this fixed? It's been open for over 2 years

levimatheri avatar Jul 16 '22 04:07 levimatheri

I'm no longer working with the customer who experienced this problem, so I can't really tell.

julioas09 avatar Jul 18 '22 18:07 julioas09