swarmkit icon indicating copy to clipboard operation
swarmkit copied to clipboard

Node shutdown, how to know the reason. (docker log -f + docker service ps are not enough)

Open stylixboom opened this issue 8 years ago • 4 comments

I have a service run with scale of 4 across 4 VMs.

Problem At some point, the node (container) shutdown, with some reason that I do not know.

I try to look at the log of such existed containers by using

$docker log EXISTED_CON_ID

#and

$docker service ps SERVICE_NAME

Both show me the output which is quite normal, nothing seemed to be a cause to the problem. Especially, docker service ps shows me no any ERROR on that shutdown node.

BTW, swarm works well in maintaining the scale after that.

Question How can I see more log, or is there a better way to figure out what actually happen, e.g. swam log?

stylixboom avatar Aug 24 '16 23:08 stylixboom

Docker swarm mode runs on top of VM or physical machine. It does not track the reason of VM failure/reboot. It tracks the connectivity change.

ubuntu@ip-172-19-241-146:~$ docker node inspect aoera24sppzujb5og1p0sivif | grep At
        "CreatedAt": "2016-07-30T01:37:07.514602651Z",
        "UpdatedAt": "2016-08-25T00:50:56.884636186Z",

In manager's docker log, it may contain connection failure log, depending on your debug level.

Aug 25 00:49:55 ip-172-19-241-146 dockerd[16953]: time="2016-08-25T00:49:55.937527806Z" level=debug msg="2016/08/25 00:49:55 [DEBUG] memberlist: Failed UDP ping: ip-172-19-241-144 (timeout reached)\n"
Aug 25 00:49:56 ip-172-19-241-146 dockerd[16953]: time="2016-08-25T00:49:56.440107428Z" level=info msg="2016/08/25 00:49:56 [INFO] memberlist: Suspect ip-172-19-241-144 has failed, no acks received\n"

For system failure reason, you may look at your OS system's log. For example on Ubuntu,

ubuntu@ip-172-19-241-144:~$ last -5 --system
ubuntu   pts/1        50.233.46.103    Thu Aug 25 00:53   still logged in
ubuntu   pts/0        50.233.46.103    Thu Aug 25 00:52   still logged in
runlevel (to lvl 5)   4.2.0-22-generic Thu Aug 25 00:50   still running
reboot   system boot  4.2.0-22-generic Thu Aug 25 00:50   still running
ubuntu   pts/0        50.233.46.103    Thu Aug 25 00:49 - 00:49  (00:00)

dongluochen avatar Aug 25 '16 01:08 dongluochen

Thank you, that's useful for investigating on the system level and node level. Sorry I didn't explain well my problem is about at the container level.

BTW, I follow your suggested command docker node inspect, it seemed all the nodes has no problem. Also, it's not about VMs problem, since all are running fine.

Since Swarm knows the container was down (at which reason?), that's why it can spin up another. So, I just want to know, if I can see more log about how swarm handles the shutdown node, do swarm know more about that container rather than $docker log ...

stylixboom avatar Aug 25 '16 01:08 stylixboom

By default swarm mode keeps 5 copies of a task instance. You can see the failure history using docker service ps SERVICE.

ubuntu@ip-172-19-241-145:~$ docker service ps ctest
ID                         NAME         IMAGE    NODE               DESIRED STATE  CURRENT STATE           ERROR
26jwyd17jz4dyg08i1j22ixgh  ctest.1      busybox  ip-172-19-241-145  Running        Running 25 seconds ago
d0pui02aqwyosz5wwyeegewwx   \_ ctest.1  busybox  ip-172-19-241-146  Shutdown       Failed 35 seconds ago   "task: non-zero exit (1)"
7ws8y9jzoyrac077f281z7r7q   \_ ctest.1  busybox  ip-172-19-241-146  Shutdown       Failed 36 seconds ago   "task: non-zero exit (1)"

It has the node ID where the task was run. You can go to the node and search for the task in docker ps -a | grep TASKID, and then docker logs CONTAINERID.

dongluochen avatar Aug 25 '16 17:08 dongluochen

I have the same problem. For one service, the container is very often killed. I cannot find the reason. Probably, either the health check failed or the defined maximum memory has reached. But how can I find the reason, where is that logged?`I can only fix the problem, if I know what the problem was… :-(

marc@dock02:~$ docker service ps mrw-cloud_nextcloud 
ID                  NAME                        IMAGE                          NODE                DESIRED STATE       CURRENT STATE             ERROR               PORTS
srcag4zd3ea9        mrw-cloud_nextcloud.1       mwaeckerlin/nextcloud:latest   dock03              Running             Running 2 minutes ago                         
5fm26o2mu248         \_ mrw-cloud_nextcloud.1   mwaeckerlin/nextcloud:latest   dock01              Shutdown            Shutdown 16 minutes ago                       
ys1j9d830zz3         \_ mrw-cloud_nextcloud.1   mwaeckerlin/nextcloud:latest   dock03              Shutdown            Shutdown 2 hours ago                          
adskaxvt0c8m         \_ mrw-cloud_nextcloud.1   mwaeckerlin/nextcloud:latest   dock03              Shutdown            Shutdown 7 hours ago                          
i2h80ytnt1dd         \_ mrw-cloud_nextcloud.1   mwaeckerlin/nextcloud:latest   dock01              Shutdown            Shutdown 8 hours ago                          

And if I look into th econtainer, I see:

…
        "Status": {
            "Timestamp": "2021-02-12T16:38:09.155983108Z",
            "State": "shutdown",
            "Message": "shutdown",
            "ContainerStatus": {
                "ContainerID": "95861e1606ba23c9a0ca0fc0d02cbe149d1e8a46b6aaa08c8c226549daea4450",
                "PID": 0,
                "ExitCode": 137
            },
            "PortStatus": {}
        },
        "DesiredState": "shutdown",
…

Where can I find the reason for the termination?!?

Was it the health-check that failed?

Was the memory exhausted (I limit the memory to 3GB)?

mwaeckerlin avatar Feb 12 '21 16:02 mwaeckerlin