weave icon indicating copy to clipboard operation
weave copied to clipboard

Weaveworks net-plugin blocks network traffic between hosts after some time

Open JensVD opened this issue 2 years ago • 2 comments

What you expected to happen?

We expected that when installing and configuring the 'weaveworks/net-plugin' docker plugin it should work. It always works at the beginning but we expect the plugin to keep working.

What happened?

We installed the 'weaveworks/net-plugin' docker plugin on our Docker swarm cluster and initially this all works but after a while it starts breaking. As of now we have detected two scenario's in which all traffic between the two interfaces on multiple hosts fails when using the weaveworks/net-plugin as the docker overlay network:

  • When installing the weaveworks/net-plugin before joining the node to the swarm. In this case traffic will never work and we just have to uninstall and reinstall the plugin again. We have adapted our procedures to make sure that this can not happen anymore
  • The other scenario is a lot worse for us because this time we installed everything according to procedure and everything works but after a while or when for example updating the service that uses this network interface all traffic gets blocked on one of the multiple hosts in the cluster. This breaks the service because one host is no longer reachable for the other causing a lot of problems

It appears that several IPTables rules are added or deleted causing the traffic to break.

How to reproduce it?

The first scenario is fairly easy to reproduce:

  • Create a docker swarm cluster
  • Provision a new node and install the weaveworks/net-plugin on a new docker host
  • Join that host to the cluster

For the second one on the other hand we are not sure how to reproduce it as it just happens at random intervals. We have noticed that it happens a lot more often in our test environment on which the services are update more often. A way to reproduce it might be:

  • Create a docker swarm cluster and install the weaveworks/net-plugin
  • Provision a service using the weaveworks/net-plugin
  • Continuously trigger updates on the service and after a while it should break

Anything else we need to know?

The infrastructure is as follows;

  • Several docker swarm hosts on the Flatcar OS
  • Flatcar OS hosts are provisioned as VM's using VMWare
  • All docker swarm services are provisioned using a docker stack YAML file

Versions:

$ docker plugin inspect weaveworks/net-plugin:latest_release

[
    {
        "Config": {
            "Args": {
                "Description": "",
                "Name": "",
                "Settable": null,
                "Value": null
            },
            "Description": "Weave Net plugin for Docker",
            "DockerVersion": "19.03.8",
            "Documentation": "https://weave.works",
            "Entrypoint": [
                "/home/weave/launch.sh"
            ],
            "Env": [
                {
                    "Description": "Log level",
                    "Name": "LOG_LEVEL",
                    "Settable": [
                        "value"
                    ],
                    "Value": ""
                },
                {
                    "Description": "Extra args to `weaver` and `plugin`",
                    "Name": "EXTRA_ARGS",
                    "Settable": [
                        "value"
                    ],
                    "Value": ""
                },
                {
                    "Description": "Encryption password",
                    "Name": "WEAVE_PASSWORD",
                    "Settable": [
                        "value"
                    ],
                    "Value": ""
                },
                {
                    "Description": "MTU",
                    "Name": "WEAVE_MTU",
                    "Settable": [
                        "value"
                    ],
                    "Value": ""
                },
                {
                    "Description": "Enable multicast for all Weave networks",
                    "Name": "WEAVE_MULTICAST",
                    "Settable": [
                        "value"
                    ],
                    "Value": ""
                },
                {
                    "Description": "The range of IP addresses used by Weave Net",
                    "Name": "IPALLOC_RANGE",
                    "Settable": [
                        "value"
                    ],
                    "Value": ""
                }
            ],
            "Interface": {
                "Socket": "weave.sock",
                "Types": [
                    "docker.networkdriver/1.0"
                ]
            },
            "IpcHost": false,
            "Linux": {
                "AllowAllDevices": false,
                "Capabilities": [
                    "CAP_NET_ADMIN",
                    "CAP_SYS_ADMIN",
                    "CAP_SYS_MODULE"
                ],
                "Devices": null
            },
            "Mounts": [
                {
                    "Description": "",
                    "Destination": "/host/proc/",
                    "Name": "",
                    "Options": [
                        "rbind",
                        "rw"
                    ],
                    "Settable": null,
                    "Source": "/proc/",
                    "Type": "bind"
                },
                {
                    "Description": "",
                    "Destination": "/var/run/docker.sock",
                    "Name": "",
                    "Options": [
                        "rbind"
                    ],
                    "Settable": null,
                    "Source": "/var/run/docker.sock",
                    "Type": "bind"
                },
                {
                    "Description": "",
                    "Destination": "/host/var/lib/",
                    "Name": "",
                    "Options": [
                        "rbind"
                    ],
                    "Settable": null,
                    "Source": "/var/lib/",
                    "Type": "bind"
                },
                {
                    "Description": "",
                    "Destination": "/host/etc/",
                    "Name": "",
                    "Options": [
                        "rbind"
                    ],
                    "Settable": null,
                    "Source": "/etc/",
                    "Type": "bind"
                },
                {
                    "Description": "",
                    "Destination": "/lib/modules/",
                    "Name": "",
                    "Options": [
                        "rbind"
                    ],
                    "Settable": null,
                    "Source": "/lib/modules/",
                    "Type": "bind"
                }
            ],
            "Network": {
                "Type": "host"
            },
            "PidHost": false,
            "PropagatedMount": "",
            "User": {},
            "WorkDir": "",
            "rootfs": {
                "diff_ids": [
                    "sha256:cb20fb41d4e05b17c03185ccd0b8d21dba65cf6c9fdfed2565d5fb9d5eba84c6"
                ],
                "type": "layers"
            }
        },
        "Enabled": true,
        "Id": "c0c227b0f1645fc6a9d3deff138cb6f77f00c6f76581b775735ea3b8846fdaa9",
        "Name": "weaveworks/net-plugin:latest_release",
        "PluginReference": "docker.io/weaveworks/net-plugin:latest_release",
        "Settings": {
            "Args": [],
            "Devices": [],
            "Env": [
                "LOG_LEVEL=",
                "EXTRA_ARGS=",
                "WEAVE_PASSWORD=*********",
                "WEAVE_MTU=",
                "WEAVE_MULTICAST=1",
                "IPALLOC_RANGE=10.10.254.0/24"
            ],
            "Mounts": [
                {
                    "Description": "",
                    "Destination": "/host/proc/",
                    "Name": "",
                    "Options": [
                        "rbind",
                        "rw"
                    ],
                    "Settable": null,
                    "Source": "/proc/",
                    "Type": "bind"
                },
                {
                    "Description": "",
                    "Destination": "/var/run/docker.sock",
                    "Name": "",
                    "Options": [
                        "rbind"
                    ],
                    "Settable": null,
                    "Source": "/var/run/docker.sock",
                    "Type": "bind"
                },
                {
                    "Description": "",
                    "Destination": "/host/var/lib/",
                    "Name": "",
                    "Options": [
                        "rbind"
                    ],
                    "Settable": null,
                    "Source": "/var/lib/",
                    "Type": "bind"
                },
                {
                    "Description": "",
                    "Destination": "/host/etc/",
                    "Name": "",
                    "Options": [
                        "rbind"
                    ],
                    "Settable": null,
                    "Source": "/etc/",
                    "Type": "bind"
                },
                {
                    "Description": "",
                    "Destination": "/lib/modules/",
                    "Name": "",
                    "Options": [
                        "rbind"
                    ],
                    "Settable": null,
                    "Source": "/lib/modules/",
                    "Type": "bind"
                }
            ]
        }
    }
]
$ docker version
Client:
 Version:           20.10.11
 API version:       1.41
 Go version:        go1.17.5
 Git commit:        e8f1871b07
 Built:             Fri Dec 10 17:40:54 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.11
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.5
  Git commit:       f6348707ab
  Built:            Fri Dec 10 17:40:03 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.8
  GitCommit:        cde01e96ed658bc5050abe1bb601b4b4510ba7a2
 runc:
  Version:          1.0.3+dev.docker-20.10
  GitCommit:        e4bccdbd64361ac5ea8ba90bb8845add78f957a6
 docker-init:
  Version:          0.19.0de40ad007797e0dcd8b7126f27bb87401d224240
  GitCommit:
$ uname -a
Linux cordocker01.qa.****.be 5.10.84-flatcar #1 SMP Fri Dec 10 17:33:00 -00 2021 x86_64 Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz GenuineIntel GNU/Linux

Logs:

$ docker logs weave
Don't think it is possible to get logs from a docker plugin

Network:

$ ip route
default via 10.5.0.1 dev ens192 proto static
10.5.0.0/16 dev ens192 proto kernel scope link src 10.5.0.6
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.18.0.0/16 dev docker_gwbridge proto kernel scope link src 172.18.0.1
172.23.126.0/25 dev ens256 proto kernel scope link src 172.23.126.3

$ ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens192    inet 10.5.0.6/16 brd 10.5.255.255 scope global ens192\       valid_lft forever preferred_lft forever
3: ens256    inet 172.23.126.3/25 brd 172.23.126.127 scope global ens256\       valid_lft forever preferred_lft forever
11: docker0    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\       valid_lft forever preferred_lft forever
12: docker_gwbridge    inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge\       valid_lft forever preferred_lft forever

If you require anymore information, just give us a sign and we'll see what we can do

Thank you!

JensVD avatar Jan 14 '22 08:01 JensVD

Hello

Has anybody been able to look at this issue. This week it has happened over 3 times in each environment causing us to have to uninstall and reinstall the plugin after which it works again. It's very weird that traffic is blocked between containers when using this plugin.

All help is greatly appreciated!

Thank you!

JensVD avatar Jan 25 '22 13:01 JensVD

Hello

Has anybody been able to look at this issue. This week it has happened over 3 times in each environment causing us to have to uninstall and reinstall the plugin after which it works again. It's very weird that traffic is blocked between containers when using this plugin.

All help is greatly appreciated!

Thank you!

Hi @JensVD,

First off, I am not part of the Weaveworks team, ok? Just a fellow dev trying to help out.

So...I've experienced a similar issue recently. At first, services and containers deployed fine, and could communicate, and after a while they lost communications to each other. Turns out the problem happened when we scaled or updated a service, or one of its containers crashed and were brought back up by Swarm. They would come back up with an out-of-sync IP address. E.g.: when looking it up at "docker network inspect", it states container A resolves to 10.32.0.5, but at Swarms DNS it was resolving to something else, usually one number higher, i.e. 10.32.0.6.

I figured the problem was being caused by the use of the --attachable flag when creating the network with the weaveworks/net-plugin driver. Just don't use --attachable and it should be fine. I don't know exactly why that happens, perhaps it's a bug, or simply it is so by design, I don't know. In most cases, there's not good reason for using that flag anyway, because when integrating the network into Swarm, it will always be able to schedule containers to the network, regardless if it is attachable or not. Unless you're looking forward to attaching stand-alone containers which are external to your Swarm cluster, but then, you'll probably be better off with some more specialized tools for that task, like Consul, which I believe you can integrate into your weave network.

In addition to that, depending on the version of your Docker Engine, you may or may not have to use a template network. When I started my project, I was on Docker Engine 19. Back then, in order for Weavenet to work properly, I had to declare a template network on each node, like this:

docker network create --config-only --subnet 10.32.0.0/12 --driver weaveworks/net-plugin:2.8.1 --gateway 10.32.0.1 myTemplateNetwork

And only then create the actual network on the master node:

docker network create --config-from myTemplateNetwork --scope swarm myNetwork

If I didn't to that, I would face some communication problems with my services, too.

Ever since I upgraded to Docker Engine 20+, not only I don't have to use the template, but also it won't even work even if I try to use it. But I can simply skip that part and define everything in the actual network without a template, and it works like a charm. In other words, I just need a single command now:

docker network create --subnet 10.32.0.0/12 --gateway 10.32.0.1 --scope swarm --driver weaveworks/net-plugin:2.8.1 myNetwork

Hope I've been able to help.

PS: It is also a good thing to check the output of "weave report". That's the best way to find out what might be going wrong with your weave service. Also, be sure to allow the TCP port 6783 on your firewall.

Cheers!

AsterixMechant avatar Feb 11 '22 23:02 AsterixMechant