moby docker service logs RPC error after network failure in swarm mode from 2/12 nodes

Description We got a docker swarm mode cluster with 12 servers - all using docker 17.06. Tonight our data center had some network probelms and some servers have been offline for a few minutes. They are all back online and healthy now.

docker node ls shows them all as ready and active and they also received some tasks, which are running. We can also start new tasks on them without problems.

However, we cannot receive logs anymore from 2 of the 12 nodes. docker service logs --tail=1000 testservice

Shows: error from daemon in stream: Error grabbing logs: rpc error: code = 2 desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node 1yjkb8d3oh2tc8hrup7641mau is not available

What can we do?

Output of docker version:

Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: false

Additional environment details (AWS, VirtualBox, physical, etc.): Ubuntu 16.04 LTS

Sep 28 '17 06:09 SvenAbels

We encountered the same error yesterday. 3 manager nodes, 3 worker nodes.

We were load testing our swarm stack and ran into some glitches - attempting to troubleshoot the glitches, we discovered we were unable to retrieve logs from one of the worker nodes with the same error as above. The node shows as ready, services are running and reachable on the machine, just no ability to retrieve logs with docker service logs from any of the services on that node.

Output of docker version is identical to above.

Ubuntu 16.04 in AWS on m4.large instances

Sep 29 '17 14:09 jdivine-spok

ping @dperny

Oct 05 '17 14:10 cpuguy83

do you know if restarting the worker causes the logs to start working again? i'm not suggesting this as a workaround because restarting workers in prod is nontrivial, but if you've tried it and it does or does not work, that would be useful information.

Oct 05 '17 20:10 dperny

@dperny we will find out if and when it happens again. I don't remember precisely what we did last week. We are on 17.09 now.

Oct 05 '17 20:10 jdivine-spok

We haven't tested this because unfortunately, both servers are productive systems so I don't want to restart them unless absolutely necessary.

Oct 06 '17 21:10 SvenAbels

This may be a duplicate of https://github.com/moby/moby/issues/33203 (is that what you were suspecting as well, @dperny ?)

Oct 06 '17 22:10 thaJeztah

we encounter similar issue on 1 of our service deployed using stack. restarting and updating the stack image does not work

xa7xctsl8sk0kfqxtpbjerror from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: rpc error: code = Internal desc = grpc: received message length 1684353658 exceeding the max size 134217728

docker version
Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:40:46 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:39:27 2017
 OS/Arch:      linux/amd64
 Experimental: false

Nov 08 '17 01:11 nurtureJamesTan

Same issue here:

root@yello:~# docker service logs staging_core_web
error from daemon in stream: Error grabbing logs: rpc error: code = 2 desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node fp39nbonqsdijfeuhhur9ouqr is not available

Logs don't come back even after worker restart. We're running Docker version 17.06.0-ce, build 02c1d87 on Ubuntu.

Nov 08 '17 10:11 phillipoertel

Still have this issue with docker version 17.10.0-ce, build f4ffd25, with a little different output:

error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: rpc error: code = Internal desc = grpc: failed to unmarshal the received message proto: illegal wireType 6

Nov 16 '17 15:11 phillipoertel

Same issue on 10 nodes swarm 17.09.0-ce. Errors varies by service and are combination of illegal wireType X and received message length X exceeding the max size 134217728 so it looks like protocol issue or data corruption.

Nov 29 '17 20:11 osidorkin

We're seeing an issue that seems similar, where a docker service logs prod-swarm_helloworld command returns some logs, but ends with an error about two nodes:

prod-swarm_helloworld.2.eqvltu7tn2fu@omni-prod-defaultworker-b4.prod.omni.carezen.net    | > [email protected] start /usr/src/app
prod-swarm_helloworld.2.eqvltu7tn2fu@omni-prod-defaultworker-b4.prod.omni.carezen.net    | > node server.js
prod-swarm_helloworld.2.eqvltu7tn2fu@omni-prod-defaultworker-b4.prod.omni.carezen.net    |
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node qsemjduuwxrl5yawc1ofgfinj is not available, node n5srg4wxoyqexvp1yazbicrlg is not available

This service only has one replica now, although it might have had more (and been updated down to 1) in the past. (Not 100% sure about that -- it might have been removed and re-created since then.)

Another curious thing: Node n5srg4wxoyqexvp1yazbicrlg is one of our other nodes; but qsemjduuwxrl5yawc1ofgfinj isn't, at least not currently. We have added nodes to this Swarm, and removed them, in the past, so it's possible that this is a node that used to exist.

Anything else we can do to gather more information here? In particular, why does Swarm think that this service with one replica should have logs on three nodes, one of which doesn't even exist any more?)

Dec 06 '17 20:12 jbscare

Same issue, running Docker version 17.12.0-ce, build c97c6d6 with managers 1 worker.

Jan 17 '18 19:01 Jeskz0rd

This may be a duplicate of #33203 (is that what you were suspecting as well, @dperny ?)

#33203 is still open, but there's a comment there saying that it looks like a duplicate of #33222, which is closed. Sounds from the previous comment like this is still an issue, though?

Jan 17 '18 19:01 jbscare

I did fix it, I don't know what really happened, but for me sounded like "iptables" rules problem. I flush the rules on the host and it started working again. Hope to help. Thanks at all!

Jan 19 '18 18:01 Jeskz0rd

I am still getting this issue, has there been a fix or workaround for it. I know of one where you just use the -f/--follow option but other than that?

Jan 19 '18 20:01 mlinsky

I'm having the same problem. ` Client: Version: 18.01.0-ce API version: 1.35 Go version: go1.9.2 Git commit: 03596f5 Built: Wed Jan 10 20:11:05 2018 OS/Arch: linux/amd64 Experimental: false Orchestrator: swarm

Server: Engine: Version: 18.01.0-ce API version: 1.35 (minimum version 1.12) Go version: go1.9.2 Git commit: 03596f5 Built: Wed Jan 10 20:09:37 2018 OS/Arch: linux/amd64 Experimental: false `

Docker error:

error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node wre79765gdccialsnwxjiy4eh is not available

Passing the -f or --fallow works. docker service logs -f serviceid

Jan 23 '18 12:01 cristhianbicca

Same thing. 3 nodes, 1 manager. sudo docker service logs kafka3 fails, kafka3 works fine.

Workarounds

use -f option
reboot node3

Client:
 Version:       17.12.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    c97c6d6
 Built: Wed Dec 27 20:11:19 2017
 OS/Arch:       linux/amd64

Server:
 Engine:
  Version:      17.12.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   c97c6d6
  Built:        Wed Dec 27 20:09:53 2017
  OS/Arch:      linux/amd64
  Experimental: true

Jan 26 '18 11:01 maxim-ge

We also see the same issue with 17.12.0-ce We have plain Vanilla setup on ubuntu 16.04 nothing fancy, after a few hours or days the error starts to appear.

We restart the node and for a few hours/days it works then it starts again.

Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: rpc error: code = Canceled desc = context canceled

Feb 06 '18 12:02 Vad1mo

Same issue, Docker version 17.09.1-ce, 3 managers 2 workers, Debian 8, Azure

Feb 07 '18 11:02 DmitriyProkhorov

Updated to 18.02 and we still see this error. However we did not scale down or up any service. At appears after some time without influence or changes from outside. Restarting nodes (sometimes all) in the cluster helps for a few hours/days.

Feb 11 15:40:47 worker-node-3 systemd-udevd[23105]: Could not generate persistent MAC address for veth3f5f398: No such file or directory
Feb 11 15:40:47 worker-node-3 systemd-udevd[23106]: Could not generate persistent MAC address for vetha72eabc: No such file or directory
Feb 11 15:40:47 worker-node-3 kernel: veth11: renamed from vetha72eabc
Feb 11 15:40:47 worker-node-3 kernel: device veth11 entered promiscuous mode
Feb 11 15:40:47 worker-node-3 kernel: device vethc7d3eda entered promiscuous mode
Feb 11 15:40:47 worker-node-3 systemd-udevd[23164]: Could not generate persistent MAC address for vethc7d3eda: No such file or directory
Feb 11 15:40:47 worker-node-3 systemd-udevd[23163]: Could not generate persistent MAC address for veth9950dca: No such file or directory
Feb 11 15:40:47 worker-node-3 kernel: IPv6: ADDRCONF(NETDEV_UP): vethc7d3eda: link is not ready
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=error msg="time=\"2018-02-11T14:40:47Z\" level=error msg=\"resource not found\" host=\"unix:///var/run/rexray/522768172
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=error msg="time=\"2018-02-11T14:40:47Z\" level=error msg=\"error: api call failed\" error.resourceID=\"cr-test_registry
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=error msg="time=\"2018-02-11T14:40:47Z\" level=error msg=\"docker-legacy: Mount: cr-test_registrydb-data: failed: resou
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47.340950579+01:00" level=error msg="fatal task error" error="VolumeDriver.Mount: docker-legacy: Mount: cr-test_registrydb-data: failed
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47+01:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/3f4bae9977ea0ada5d6d1fc31124cb7f0e1dc8ded
Feb 11 15:40:47 worker-node-3 kernel: IPVS: Creating netns size=2192 id=2276
Feb 11 15:40:47 worker-node-3 kernel: eth0: renamed from veth3f5f398
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered disabled state
Feb 11 15:40:47 worker-node-3 kernel: br0: port 2(veth11) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: br0: port 2(veth11) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: eth1: renamed from veth9950dca
Feb 11 15:40:47 worker-node-3 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc7d3eda: link becomes ready
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 kernel: docker_gwbridge: port 3(vethc7d3eda) entered forwarding state
Feb 11 15:40:47 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:47.954310542+01:00" level=warning msg="failed to deactivate service binding for container cr-test_helmsmandb.1.rovepl7s9jp77pruaw0n1hey
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.753833142+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.754664008+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.755839959+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:48 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:48.756443373+01:00" level=error msg="logs call failed" error="failed getting container logs: No such container: cr-test_helmsmandb.1.zz
Feb 11 15:40:52 worker-node-3 dockerd[13857]: time="2018-02-11T15:40:52.363615491+01:00" level=error msg="fatal task error" error="create cr-test_pg-data: driver 'local' already has volume 'cr-test_pg-dat

Feb 11 '18 14:02 Vad1mo

Similar issue here, though there's something interesting that I don't think anyone has mentioned. Maybe it's useful, maybe it's not. When I try to get the logs for the service, I get the same as others

| dbug: gateway.connection    error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: rpc error: code = Canceled desc = context canceled

This is a service with only one task. So I listed the tasks for the service

$ docker service ps gateway_teltonika
ID                  NAME                      CURRENT STATE
1p5k253l5ogc        gateway_teltonika.1       Running

And tried retrieving the logs for the task itself.

$ docker service logs 1p5k

This works, the logs are there and they're updated.

I'm using version 17.12.0-ce.

Client:
 Version:       17.12.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    c97c6d6
 Built: Wed Dec 27 20:05:22 2017
 OS/Arch:       windows/amd64

Server:
 Engine:
  Version:      17.12.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   c97c6d6
  Built:        Wed Dec 27 20:09:53 2017
  OS/Arch:      linux/amd64
  Experimental: false

Mar 08 '18 10:03 goncalo-oliveira

I think the certificates have changed and are out of sync between the manager and the worker. Experienced this with multiple Docker versions. My initial workaround to fix the problem was create a new swarm (if it was on the master node) / leave and rejoin (if it was on the worker node).

The fix witch seam more of a solution and worked for me was to execute on the manager node:

docker swarm ca --rotate

This resented the certificates between the nodes and then I could retrieved the logs from the services. Also I could redeploy new services.

Mar 14 '18 13:03 alexmnt

@alexmnt Even after docker swarm ca --rotate I still receive this error. I'm on new install running for 48h.

Client:
 Version:	17.12.0-ce
 API version:	1.35
 Go version:	go1.9.2
 Git commit:	c97c6d6
 Built:	Wed Dec 27 20:11:19 2017
 OS/Arch:	linux/amd64

Server:
 Engine:
  Version:	17.12.0-ce
  API version:	1.35 (minimum version 1.12)
  Go version:	go1.9.2
  Git commit:	c97c6d6
  Built:	Wed Dec 27 20:09:53 2017
  OS/Arch:	linux/amd64
  Experimental:	false

Mar 19 '18 08:03 Voovode

i think it's likely that the CA rotation is only making logs work again as a side effect. probably something about updating the node. if other things work and logs don't, this is probably an issue in the log broker component on the manager. i haven't had time to look at it yet, for which i sincerely apologize, but i'll try to block out some time to fix it.

the bright side is i've become a much better programmer compared to when we first shipped logs, so when i finally do get around to fixing it it'll likely be much more stable 😄

Mar 19 '18 17:03 dperny

FYI. We had this issue today with 17.09.1. After couple of hours troubleshooting cloned one manager and one worker VMs to sandbox network and there they was working fine immediately after start.

So then I rebooted all the manager nodes on production and leaved worker nodes running and that fixed the issue (one more good reason to have dedicated manager nodes).

Edit: Now it happened again and here is logs from all managers: https://pastebin.com/w4DvPhf7

Apr 11 '18 17:04 olljanat

Same issue in my prod swarm, tracking this issue.

Apr 29 '18 03:04 kevb

@kevb Which docker version you are running?

I can see that there is plenty of fixes included to 18.03.0-ce version so it would be interesting to here if anyone have seen this issue with that one?

Apr 30 '18 05:04 olljanat

I'm seeing this with 18.03

Apr 30 '18 11:04 Shookit

I'm seeing same issue with 18.03

Apr 30 '18 19:04 mrpatrick

FYI. I created stress testing tool for Swarmkit which builds latest version of it from puts it to env where network failures can be generated on purpose.

I was able to reproduce this issue with it. Here is some logs: https://pastebin.com/raw/8RW5AE4w and https://pastebin.com/raw/2tgz57nH

@dperny / @thaJeztah I can also see that there is now more information than earlier because of docker/swarmkit#2541

I hope that this helps you find and fix the root cause of issue.

EDIT: Here is now also full log from all the five nodes: https://pastebin.com/raw/wvFL3Aas

May 01 '18 12:05 olljanat