swarmkit
swarmkit copied to clipboard
Node shutdown, how to know the reason. (docker log -f + docker service ps are not enough)
I have a service run with scale of 4 across 4 VMs.
Problem At some point, the node (container) shutdown, with some reason that I do not know.
I try to look at the log of such existed containers by using
$docker log EXISTED_CON_ID
#and
$docker service ps SERVICE_NAME
Both show me the output which is quite normal, nothing seemed to be a cause to the problem. Especially, docker service ps shows me no any ERROR on that shutdown node.
BTW, swarm works well in maintaining the scale after that.
Question How can I see more log, or is there a better way to figure out what actually happen, e.g. swam log?
Docker swarm mode runs on top of VM or physical machine. It does not track the reason of VM failure/reboot. It tracks the connectivity change.
ubuntu@ip-172-19-241-146:~$ docker node inspect aoera24sppzujb5og1p0sivif | grep At
"CreatedAt": "2016-07-30T01:37:07.514602651Z",
"UpdatedAt": "2016-08-25T00:50:56.884636186Z",
In manager's docker log, it may contain connection failure log, depending on your debug level.
Aug 25 00:49:55 ip-172-19-241-146 dockerd[16953]: time="2016-08-25T00:49:55.937527806Z" level=debug msg="2016/08/25 00:49:55 [DEBUG] memberlist: Failed UDP ping: ip-172-19-241-144 (timeout reached)\n"
Aug 25 00:49:56 ip-172-19-241-146 dockerd[16953]: time="2016-08-25T00:49:56.440107428Z" level=info msg="2016/08/25 00:49:56 [INFO] memberlist: Suspect ip-172-19-241-144 has failed, no acks received\n"
For system failure reason, you may look at your OS system's log. For example on Ubuntu,
ubuntu@ip-172-19-241-144:~$ last -5 --system
ubuntu pts/1 50.233.46.103 Thu Aug 25 00:53 still logged in
ubuntu pts/0 50.233.46.103 Thu Aug 25 00:52 still logged in
runlevel (to lvl 5) 4.2.0-22-generic Thu Aug 25 00:50 still running
reboot system boot 4.2.0-22-generic Thu Aug 25 00:50 still running
ubuntu pts/0 50.233.46.103 Thu Aug 25 00:49 - 00:49 (00:00)
Thank you, that's useful for investigating on the system level and node level. Sorry I didn't explain well my problem is about at the container level.
BTW, I follow your suggested command docker node inspect
, it seemed all the nodes has no problem.
Also, it's not about VMs problem, since all are running fine.
Since Swarm knows the container was down (at which reason?), that's why it can spin up another.
So, I just want to know, if I can see more log about how swarm handles the shutdown node, do swarm know more about that container rather than $docker log ...
By default swarm mode keeps 5 copies of a task instance. You can see the failure history using docker service ps SERVICE
.
ubuntu@ip-172-19-241-145:~$ docker service ps ctest
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
26jwyd17jz4dyg08i1j22ixgh ctest.1 busybox ip-172-19-241-145 Running Running 25 seconds ago
d0pui02aqwyosz5wwyeegewwx \_ ctest.1 busybox ip-172-19-241-146 Shutdown Failed 35 seconds ago "task: non-zero exit (1)"
7ws8y9jzoyrac077f281z7r7q \_ ctest.1 busybox ip-172-19-241-146 Shutdown Failed 36 seconds ago "task: non-zero exit (1)"
It has the node ID where the task was run. You can go to the node and search for the task in docker ps -a | grep TASKID
, and then docker logs CONTAINERID
.
I have the same problem. For one service, the container is very often killed. I cannot find the reason. Probably, either the health check failed or the defined maximum memory has reached. But how can I find the reason, where is that logged?`I can only fix the problem, if I know what the problem was… :-(
marc@dock02:~$ docker service ps mrw-cloud_nextcloud
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
srcag4zd3ea9 mrw-cloud_nextcloud.1 mwaeckerlin/nextcloud:latest dock03 Running Running 2 minutes ago
5fm26o2mu248 \_ mrw-cloud_nextcloud.1 mwaeckerlin/nextcloud:latest dock01 Shutdown Shutdown 16 minutes ago
ys1j9d830zz3 \_ mrw-cloud_nextcloud.1 mwaeckerlin/nextcloud:latest dock03 Shutdown Shutdown 2 hours ago
adskaxvt0c8m \_ mrw-cloud_nextcloud.1 mwaeckerlin/nextcloud:latest dock03 Shutdown Shutdown 7 hours ago
i2h80ytnt1dd \_ mrw-cloud_nextcloud.1 mwaeckerlin/nextcloud:latest dock01 Shutdown Shutdown 8 hours ago
And if I look into th econtainer, I see:
…
"Status": {
"Timestamp": "2021-02-12T16:38:09.155983108Z",
"State": "shutdown",
"Message": "shutdown",
"ContainerStatus": {
"ContainerID": "95861e1606ba23c9a0ca0fc0d02cbe149d1e8a46b6aaa08c8c226549daea4450",
"PID": 0,
"ExitCode": 137
},
"PortStatus": {}
},
"DesiredState": "shutdown",
…
Where can I find the reason for the termination?!?
Was it the health-check that failed?
Was the memory exhausted (I limit the memory to 3GB)?