docker service logs RPC error after network failure in swarm mode from 2/12 nodes
Description We got a docker swarm mode cluster with 12 servers - all using docker 17.06. Tonight our data center had some network probelms and some servers have been offline for a few minutes. They are all back online and healthy now.
docker node ls shows them all as ready and active and they also received some tasks, which are running. We can also start new tasks on them without problems.
However, we cannot receive logs anymore from 2 of the 12 nodes.
docker service logs --tail=1000 testservice
Shows:
error from daemon in stream: Error grabbing logs: rpc error: code = 2 desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node 1yjkb8d3oh2tc8hrup7641mau is not available
What can we do?
Output of docker version:
Client:
Version: 17.06.0-ce
API version: 1.30
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:23:31 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.0-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:19:04 2017
OS/Arch: linux/amd64
Experimental: false
Additional environment details (AWS, VirtualBox, physical, etc.): Ubuntu 16.04 LTS
We encountered the same error yesterday. 3 manager nodes, 3 worker nodes.
We were load testing our swarm stack and ran into some glitches - attempting to troubleshoot the glitches, we discovered we were unable to retrieve logs from one of the worker nodes with the same error as above. The node shows as ready, services are running and reachable on the machine, just no ability to retrieve logs with docker service logs from any of the services on that node.
Output of docker version is identical to above.
Ubuntu 16.04 in AWS on m4.large instances
ping @dperny
do you know if restarting the worker causes the logs to start working again? i'm not suggesting this as a workaround because restarting workers in prod is nontrivial, but if you've tried it and it does or does not work, that would be useful information.
@dperny we will find out if and when it happens again. I don't remember precisely what we did last week. We are on 17.09 now.
We haven't tested this because unfortunately, both servers are productive systems so I don't want to restart them unless absolutely necessary.
This may be a duplicate of https://github.com/moby/moby/issues/33203 (is that what you were suspecting as well, @dperny ?)
we encounter similar issue on 1 of our service deployed using stack. restarting and updating the stack image does not work
xa7xctsl8sk0kfqxtpbjerror from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: rpc error: code = Internal desc = grpc: received message length 1684353658 exceeding the max size 134217728
docker version
Client:
Version: 17.09.0-ce
API version: 1.32
Go version: go1.8.3
Git commit: afdb6d4
Built: Tue Sep 26 22:40:46 2017
OS/Arch: linux/amd64
Server:
Version: 17.09.0-ce
API version: 1.32 (minimum version 1.12)
Go version: go1.8.3
Git commit: afdb6d4
Built: Tue Sep 26 22:39:27 2017
OS/Arch: linux/amd64
Experimental: false
Same issue here:
root@yello:~# docker service logs staging_core_web
error from daemon in stream: Error grabbing logs: rpc error: code = 2 desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node fp39nbonqsdijfeuhhur9ouqr is not available
Logs don't come back even after worker restart. We're running Docker version 17.06.0-ce, build 02c1d87 on Ubuntu.
Still have this issue with docker version 17.10.0-ce, build f4ffd25, with a little different output:
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: rpc error: code = Internal desc = grpc: failed to unmarshal the received message proto: illegal wireType 6
Same issue on 10 nodes swarm 17.09.0-ce. Errors varies by service and are combination of illegal wireType X and received message length X exceeding the max size 134217728 so it looks like protocol issue or data corruption.
We're seeing an issue that seems similar, where a docker service logs prod-swarm_helloworld command returns some logs, but ends with an error about two nodes:
prod-swarm_helloworld.2.eqvltu7tn2fu@omni-prod-defaultworker-b4.prod.omni.carezen.net | > [email protected] start /usr/src/app
prod-swarm_helloworld.2.eqvltu7tn2fu@omni-prod-defaultworker-b4.prod.omni.carezen.net | > node server.js
prod-swarm_helloworld.2.eqvltu7tn2fu@omni-prod-defaultworker-b4.prod.omni.carezen.net |
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node qsemjduuwxrl5yawc1ofgfinj is not available, node n5srg4wxoyqexvp1yazbicrlg is not available
This service only has one replica now, although it might have had more (and been updated down to 1) in the past. (Not 100% sure about that -- it might have been removed and re-created since then.)
Another curious thing: Node n5srg4wxoyqexvp1yazbicrlg is one of our other nodes; but qsemjduuwxrl5yawc1ofgfinj isn't, at least not currently. We have added nodes to this Swarm, and removed them, in the past, so it's possible that this is a node that used to exist.
Anything else we can do to gather more information here? In particular, why does Swarm think that this service with one replica should have logs on three nodes, one of which doesn't even exist any more?)
Same issue, running Docker version 17.12.0-ce, build c97c6d6 with managers 1 worker.
This may be a duplicate of #33203 (is that what you were suspecting as well, @dperny ?)
#33203 is still open, but there's a comment there saying that it looks like a duplicate of #33222, which is closed. Sounds from the previous comment like this is still an issue, though?
I did fix it, I don't know what really happened, but for me sounded like "iptables" rules problem. I flush the rules on the host and it started working again. Hope to help. Thanks at all!
I am still getting this issue, has there been a fix or workaround for it. I know of one where you just use the -f/--follow option but other than that?
I'm having the same problem. ` Client: Version: 18.01.0-ce API version: 1.35 Go version: go1.9.2 Git commit: 03596f5 Built: Wed Jan 10 20:11:05 2018 OS/Arch: linux/amd64 Experimental: false Orchestrator: swarm
Server: Engine: Version: 18.01.0-ce API version: 1.35 (minimum version 1.12) Go version: go1.9.2 Git commit: 03596f5 Built: Wed Jan 10 20:09:37 2018 OS/Arch: linux/amd64 Experimental: false `
Docker error:
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node wre79765gdccialsnwxjiy4eh is not available
Passing the -f or --fallow works. docker service logs -f serviceid
Same thing. 3 nodes, 1 manager. sudo docker service logs kafka3 fails, kafka3 works fine.
Workarounds
- use
-foption - reboot node3
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:11:19 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:53 2017
OS/Arch: linux/amd64
Experimental: true
We also see the same issue with 17.12.0-ce We have plain Vanilla setup on ubuntu 16.04 nothing fancy, after a few hours or days the error starts to appear.
We restart the node and for a few hours/days it works then it starts again.
Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: rpc error: code = Canceled desc = context canceled
Same issue, Docker version 17.09.1-ce, 3 managers 2 workers, Debian 8, Azure
Updated to 18.02 and we still see this error. However we did not scale down or up any service. At appears after some time without influence or changes from outside. Restarting nodes (sometimes all) in the cluster helps for a few hours/days.
Feb 11 15:40:47 worker-node-3 systemd-udevd[23105]: Could not generate persistent MAC address for veth3f5f398: No such file or directory
Feb 11 15:40:47 worker-node-3 systemd-udevd[23106]: Could not generate persistent MAC address for vetha72eabc: No such file or directory
Feb 11 15:40:47 worker-node-3 kernel: veth11: renamed from vetha72eabc
Feb 11 15:40:47 worker-node-3 kernel: device veth11 entered promiscuous mode
Feb 11 15:40:47 worker-node-3 kernel: device vethc7d3eda entered promiscuous mode
Feb 11 15:40:47 worker-node-3 systemd-udevd[23164]: Could not generate persistent MAC address for vethc7d3eda: No such file or directory
Feb 11 15:40:47 worker-node-3 systemd-udevd[23163]: Could not generate persistent MAC address for veth9950dca: No such file or directory
Feb 11 15:40:47 worker-node-3 kernel: IPv6: ADDRCONF(NETDEV_UP): vethc7d3eda: link is not ready
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=error msg="time=\"2018-02-11T14:40:47Z\" level=error msg=\"resource not found\" host=\"unix:///var/run/rexray/522768172
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=error msg="time=\"2018-02-11T14:40:47Z\" level=error msg=\"error: api call failed\" error.resourceID=\"cr-test_registry
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=error msg="time=\"2018-02-11T14:40:47Z\" level=error msg=\"docker-legacy: Mount: cr-test_registrydb-data: failed: resou
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47.340950579+01:00" level=error msg="fatal task error" error="VolumeDriver.Mount: docker-legacy: Mount: cr-test_registrydb-data: failed
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/3f4bae9977ea0ada5d6d1fc31124cb7f0e1dc8ded
Feb 11 15:40:47 worker-node-3 kernel: IPVS: Creating netns size=2192 id=2276
Feb 11 15:40:47 worker-node-3 kernel: eth0: renamed from veth3f5f398
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered disabled state
Feb 11 15:40:47 worker-node-3 kernel: br0: port 2(veth11) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: br0: port 2(veth11) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: eth1: renamed from veth9950dca
Feb 11 15:40:47 worker-node-3 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc7d3eda: link becomes ready
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47.954310542+01:00" level=warning msg="failed to deactivate service binding for container cr-test_helmsmandb.1.rovepl7s9jp77pruaw0n1hey
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.753833142+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.754664008+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.755839959+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.756443373+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:52 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:52.363615491+01:00" level=error msg="fatal task error" error="create cr-test_pg-data: driver 'local' already has volume 'cr-test_pg-dat
Similar issue here, though there's something interesting that I don't think anyone has mentioned. Maybe it's useful, maybe it's not. When I try to get the logs for the service, I get the same as others
| dbug: gateway.connection error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: rpc error: code = Canceled desc = context canceled
This is a service with only one task. So I listed the tasks for the service
$ docker service ps gateway_teltonika
ID NAME CURRENT STATE
1p5k253l5ogc gateway_teltonika.1 Running
And tried retrieving the logs for the task itself.
$ docker service logs 1p5k
This works, the logs are there and they're updated.
I'm using version 17.12.0-ce.
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:05:22 2017
OS/Arch: windows/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:53 2017
OS/Arch: linux/amd64
Experimental: false
I think the certificates have changed and are out of sync between the manager and the worker. Experienced this with multiple Docker versions. My initial workaround to fix the problem was create a new swarm (if it was on the master node) / leave and rejoin (if it was on the worker node).
The fix witch seam more of a solution and worked for me was to execute on the manager node:
docker swarm ca --rotate
This resented the certificates between the nodes and then I could retrieved the logs from the services. Also I could redeploy new services.
@alexmnt
Even after
docker swarm ca --rotate
I still receive this error.
I'm on new install running for 48h.
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:11:19 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:53 2017
OS/Arch: linux/amd64
Experimental: false
i think it's likely that the CA rotation is only making logs work again as a side effect. probably something about updating the node. if other things work and logs don't, this is probably an issue in the log broker component on the manager. i haven't had time to look at it yet, for which i sincerely apologize, but i'll try to block out some time to fix it.
the bright side is i've become a much better programmer compared to when we first shipped logs, so when i finally do get around to fixing it it'll likely be much more stable 😄
FYI. We had this issue today with 17.09.1. After couple of hours troubleshooting cloned one manager and one worker VMs to sandbox network and there they was working fine immediately after start.
So then I rebooted all the manager nodes on production and leaved worker nodes running and that fixed the issue (one more good reason to have dedicated manager nodes).
Edit: Now it happened again and here is logs from all managers: https://pastebin.com/w4DvPhf7
Same issue in my prod swarm, tracking this issue.
@kevb Which docker version you are running?
I can see that there is plenty of fixes included to 18.03.0-ce version so it would be interesting to here if anyone have seen this issue with that one?
I'm seeing this with 18.03
I'm seeing same issue with 18.03
FYI. I created stress testing tool for Swarmkit which builds latest version of it from puts it to env where network failures can be generated on purpose.
I was able to reproduce this issue with it. Here is some logs: https://pastebin.com/raw/8RW5AE4w and https://pastebin.com/raw/2tgz57nH
@dperny / @thaJeztah I can also see that there is now more information than earlier because of docker/swarmkit#2541
I hope that this helps you find and fix the root cause of issue.
EDIT: Here is now also full log from all the five nodes: https://pastebin.com/raw/wvFL3Aas