swarmkit icon indicating copy to clipboard operation
swarmkit copied to clipboard

Service is not DNS resolvable from another one if containers run on different nodes

Open vasily-kirichenko opened this issue 9 years ago • 46 comments

I have two services running a single container each, on different nodes, using same "overlay" network. When I try to ping one container from inside the other via service name, it fails:

ping akka-test
ping: bad address 'akka-test'

After I scaled the akka-test service so that a container runs on the node where the other container is running, everything suddenly starts to work.

So my questing is: is my assumption valid that services should be discoverable across entire Swarm? I mean, name of a service should be DNS resolvable from any other container in this Swarm, no matter where containers are running.

$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
255fedab2fc4        bridge              bridge              local
9a450f033c48        docker_gwbridge     bridge              local
6e76844033f8        host                host                local
dzwgdein8cxa        ingress             overlay             swarm
54uqc60vx1i5        net2                overlay             swarm
d632a42ef140        none                null                local
$ docker service ls
ID            NAME         REPLICAS  IMAGE                             COMMAND
0wyv4gq14mnu  akka-test    8/8       xxxx:5000/akkahttp1:1.20
cg7r4ius7xfm  akka-test-2  1/1       xxxx:5000/akkahttp1:1.20
$ docker service inspect --pretty akka-test
ID:             0wyv4gq14mnuj8kfolizh1h23
Name:           akka-test
Mode:           Replicated
 Replicas:      8
Placement:
UpdateConfig:
 Parallelism:   1
 On failure:    pause
ContainerSpec:
 Image:         xxxx:5000/akkahttp1:1.20
Resources:
Networks: 54uqc60vx1i57d3qnmhza82c4
$ docker service inspect --pretty akka-test-2
ID:             cg7r4ius7xfmgvazmptvarn2k
Name:           akka-test-2
Mode:           Replicated
 Replicas:      1
Placement:
UpdateConfig:
 Parallelism:   1
 On failure:    pause
ContainerSpec:
 Image:         xxxx:5000/akkahttp1:1.20
Resources:
Networks: 54uqc60vx1i57d3qnmhza82c4
$ docker info
Containers: 75
 Running: 11
 Paused: 0
 Stopped: 64
Images: 42
Server Version: 1.12.1-rc1
Storage Driver: devicemapper
 Pool Name: docker-253:0-135409124-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 8.291 GB
 Data Space Total: 107.4 GB
 Data Space Available: 40.86 GB
 Metadata Space Used: 19.61 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.128 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null overlay host bridge
Swarm: active
 NodeID: ao1wz862t6n4yog4hpi4yqm20
 Is Manager: true
 ClusterID: 3hpbbe2jtdoqe1zvxs41cycoq
 Managers: 3
 Nodes: 4
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: xxxx
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.28.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 188.6 GiB
Name: xxxx
ID: OWEH:OIIR:7NZ6:IKZV:RFJ4:NXAZ:NH7H:WPLC:D457:DKGN:CH2C:E2UE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 127.0.0.0/8

vasily-kirichenko avatar Aug 24 '16 12:08 vasily-kirichenko

I'm seeing this too. I'm using Docker for AWS and this has happened both on beta4 and now on beta5. Service names are sometimes unresolvable, sometimes resolvable but no route to host. It also works sometimes. I've been so far unable to reliably reproduce from scratch.

kaii-zen avatar Aug 25 '16 15:08 kaii-zen

Because of some networking limitations (I think related to virtual IPs), the ping tool will not work with overlay networking. Are you service names resolvable with other tools like dig?

Take a look at this guide, if you haven't already: https://docs.docker.com/engine/swarm/networking/

dperny avatar Aug 26 '16 18:08 dperny

@dperny Thanks, will check with dig.

vasily-kirichenko avatar Aug 27 '16 08:08 vasily-kirichenko

Sure. Let me know whether or not that fixes the issue, so I can know to close the issue or take a deeper look.

dperny avatar Aug 29 '16 16:08 dperny

I could not find a docker image with dig installed, so I tested with nslookup. It could not resolve service if container was running on a different node.

vasily-kirichenko avatar Aug 29 '16 18:08 vasily-kirichenko

Can you give some more information for reproducing? I tried to reproduce by creating a 3 node cluster with 1 manager.

# create new network
$ docker network create --driver overlay net
# create web service
$ docker service create --network net --name web nginx
# web landed on node-2
# create busybox service for lookups
$ docker service create --network net --name probe busybox sleep 3000
# probe landed on node-3
# now, from node 3
$ docker exec -it <busybox container id> /bin/sh

/ # nslookup web
Server:    127.0.0.11
Address 1: 127.0.0.11

Name:      web
Address 1: 10.0.0.2
/ # nslookup probe
Server:    127.0.0.11
Address 1: 127.0.0.11

Name:      probe
Address 1: 10.0.0.4
/ # nslookup butterpecans
Server: 127.0.0.11
Address 1: 127.0.0.11

nslookup: can't resolve 'butterpecans'

So this appears to work for me.

dperny avatar Aug 29 '16 18:08 dperny

Do you have TCP port 7946 open on your hosts? Gossip needs that port open for networking to work correctly.

dperny avatar Aug 29 '16 21:08 dperny

@dperny create your services without vip endpoint mode. It occurs specifically with dnsrr for certain. However, it may work with any mode that doesn't generate a proxy address.

Ayiga avatar Sep 09 '16 15:09 Ayiga

@ayiga Just tried the above steps but added --endpoint-mode dnsrr and it resolves properly. Is your failure intermittent, or consistent?

dperny avatar Sep 09 '16 18:09 dperny

It's consistent. From my experience the DNS resolution is only capable of resolving containers that exist on the same Node. The Manager is capable of resolving containers throughout the swarm. (Sometimes it doesn't, I'm not sure for the cause of it). But this issue is primarily with Worker Nodes.

I did a full write up of my steps in the post: https://forums.docker.com/t/container-dns-resolution-in-ingress-overlay-network/21399 for Docker for AWS. However, the issue is easily reproducible from my personal setup, between a Linux Box (Ubuntu variant), and my Mac using Docker for Mac.

Ayiga avatar Sep 11 '16 01:09 Ayiga

I am also experiencing this issue, with my setup as follows. I have three AWS EC2 nodes, all on a private shared network where they can communicate on all ports (I have verified all nodes can reach all other nodes on the ports specified in the Swarm 1.12 documentation). I create containers on a shared overlay network (verified the overlay interface exists and correctly is routed through the specified subnet), and only when two containers are on the same node can they communicate via their VIP or hostname. When containers are on different nodes, I will receive a "no route to host" message when attempting to connect to each other.

c4wrd avatar Sep 15 '16 15:09 c4wrd

@Ayiga @vasily-kirichenko I actually just resolved this by changing the subnet of my overlay network. Previously it had been on 172.0.0.0/24, but for some reason I believe this is conflicting with the docker networking interfaces (even though it doesn't). Now I can resolve containers on other nodes by hostname and VIP without issue. Here's how I created the network for reference:

docker network create \
         --driver overlay \
         --subnet 10.10.9.0/24 \
        selenium-grid

c4wrd avatar Sep 15 '16 16:09 c4wrd

Is there any further resolution on this?

I'm running an AWS 4 node (3 manager, 1 worker) swarm, all under Docker 1.13.1. I'm using Docker Compose with an external network overlay network created in attachable mode, using a subnet different from the host, for all the services in the docker compose network and deploying it with docker stack deploy --compose-file.

Even if I add another 3 nodes as dedicated docker managers with availability set to drained, and everything else set worker mode, I still encounter services that cannot access other services over the overlay network. All the services are defined in the compose file.

Attempting to resolve a service name to an IP address via dig or nslookup (using 127.0.0.11 as the DNS server) results in no records for other tasks running on that overlay network.

Docker Info

Containers: 5
 Running: 1
 Paused: 0
 Stopped: 4
Images: 10
Server Version: 1.13.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 89
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: fk3f2buol2b6azvqap8pdhzup
 Is Manager: false
 Node Address: 11.0.12.39
 Manager Addresses:
  11.0.10.7:2377
  11.0.11.18:2377
  11.0.12.45:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-62-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 59.97 GiB
Name: ip-11-0-12-39
ID: 3XUZ:DCGO:F474:GNKB:2VN6:ZJYE:LPWJ:SPOS:HGR3:UJVX:RATM:TMRT
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

glorious-beard avatar Feb 17 '17 01:02 glorious-beard

@zen-chetan Try @c4wrd's suggestion to use a different subnet IP to see if that resolves the issue.

I've seen an issue like this before and it was because AWS nodes had /etc/resolv.conf that pointed to a 10.0.0.x IP address in the VPC subnet (common), but Docker DNS was getting confused because the subnets of the created overlay(s) would also be in that range.

I'd argue that maybe the default subnet for overlay networks should be changed as it overlaps with a very common internal IP subnet. e.g., the getting started with Amazon VPC guide uses 10.0.0.0/24.

image

At very least this should probably be covered in Docker documentation.

nathanleclaire avatar Feb 17 '17 01:02 nathanleclaire

@sanimej @aboch I'm curious your thoughts on the above ^^

nathanleclaire avatar Feb 17 '17 01:02 nathanleclaire

Thanks @nathanleclaire for your suggestion. However, I am running different subnets for the AWS hosts and the overlay network.

The hosts are running in the subnet 11.0.0.0. Here's the output of /etc/resolve.conf for one of the hosts that can't resolve DNS for containers running on it.

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 11.0.0.2
search us-west-2.compute.internal

The docker overlay network runs with the subnet 10.0.10.0/24. docker network inspect output...

[
    {
        "Name": "brain_net",
        "Id": "ip81kx5shqsenzsalo04oxpzk",
        "Created": "2017-02-17T01:34:40.16419404Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.10.0/24",
                    "Gateway": "10.0.10.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Containers": {
            "86473d760c0ab112adec455c8b65213734d35c8c26f1db0719d40a1f6fd6f61a": {
                "Name": "alpha_gpu_engineer.fk3f2buol2b6azvqap8pdhzup.xatrh8z4dmxncqztiz9i85ls5",
                "EndpointID": "5af3262a75990cda4f6354aa194b57b47f2f888f8e507f6bbfa5e609e2f7490c",
                "MacAddress": "02:42:0a:00:0a:0a",
                "IPv4Address": "10.0.10.10/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4097"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "ip-11-0-12-39-aab579cb2196",
                "IP": "11.0.12.39"
            }
        ]
    }
]

Here's the /etc/resolv.conf for one of the containers in the host:

search us-west-2.compute.internal
nameserver 127.0.0.11
options ndots:0

To rule out AWS SGs, I've also completely opened all ports for both UDP and TCP, incoming and outgoing, for the security group all of the nodes run in.

glorious-beard avatar Feb 17 '17 01:02 glorious-beard

@vasily-kirichenko

So my questing is: is my assumption valid that services should be discoverable across entire Swarm? I mean, name of a service should be DNS resolvable from any other container in this Swarm, no matter where containers are running.

Yes, they will be discoverable by any container no matter where it is running as long as it is connected to the same network.

@zen-chetan

Here's the output of /etc/resolve.conf for one of the hosts that can't resolve DNS for containers running on it.

Not sure if you meant that, but if you were expecting to be able to resolve the service name from the host, that is not possible. The service name is only discoverable from inside the swarm networks the service is attached to.

For the rest, I only have some generic comments:

As @dperny suggested, in order for the network control plane info (like the internal dns records) to spread in the cluster, please make sure both tcp/7946 and udp/7946 are open on each and every node and security group rules allow them.

You system will be subject to the overlay/host subnet conflict, as @nathanleclaire was mentioning, if you see vx-<ID> named interfaces in your hosts where a container is running on an overlay network. If no vx-<ID> named interfaces are there, then your overlay network subnet can overlap with the hosts VPC subnet.

When things do not work with stack deploy, try to create the docker services manually to see if the problem is or not specific to docker stack.

aboch avatar Feb 17 '17 05:02 aboch

Thank you @aboch. My intention in showing the host's /etc/resolv.conf was to demonstrate that the host and the name server it uses does not seem to overlap the docker overlay network's subnet.

Regarding vx-<ID> interfaces, I see a lot of veth* interfaces created when everything is running, with different numbers of interfaces on each host in the node, ranging from as little one to as much as 13. These interfaces are present both on the host and in a container started with the docker stack deploy command. How do I check for these vx-<ID> interfaces?

glorious-beard avatar Feb 17 '17 06:02 glorious-beard

@zen-chetan

My intention in showing the host's /etc/resolv.conf was to demonstrate that the host and the name server it uses does not seem to overlap the docker overlay network's subnet.

Ah I see, thanks. But given swarm networks are global scope networks, the overlap check is not run for their subnets. This is why the issue could arise with kernels which do not support creating the vxlan interface in a separate netns. Libnetwork detects if the kernel supports that feature. If it does not, then it creates the vxlan interfaces (one per subnet per overlay network) in the host namespace with names vx-....

How do I check for these vx-<ID> interfaces?

If you do not see any of those in the ip link o/p, then it means you do not need to worry about which subnet was chosen for the overlay network. Just make sure this is true for all the hosts the overlay network spans.

I see a lot of veth* interfaces created when everything is running,

Yes those are the interfaces connecting each container on the overlay network with the default_gwbridge network, to provide outside world connection to the cotainers.

aboch avatar Feb 17 '17 17:02 aboch

So I'm still stumped by this...

Given three AWS nodes running in a private VPC subnet with the security group set to allow all traffic in and out on all ports, both UDP, and TCP on the subnet 11.0.0.0/8, I still cannot get obtain the IP address of services running on other nodes in the docker swarm. Any services running on a node in the swarm can get IP address for services running on the same node.

How to reproduce... 1 - Create an attachable network (docker-compose version 3 files still don't support attachable overlay networks)

docker network create --driver overlay --attachable --subnet 192.168.1.0/24 alpha_net

2 - Start the following docker-compose file with docker stack deploy --compose-file=docker-compose.yml alpha This is a stripped down sample that creates a consul cluster. I've left out some of the other services from the compose file.

version: "3.1"

services:

  # Consul server 1 of 3
  consul1:
    image: consul:0.7.5
    command: agent -bind=0.0.0.0 -client=0.0.0.0 -advertise='{{ GetAllInterfaces | include "network" "192.168.1.0/24" | attr "address" }}' -log-level=INFO -node=config-server-1 -server -bootstrap-expect=3 -rejoin -retry-join=consul1 -retry-join=consul2 -retry-join=consul3
    environment:
      SERVICE_8500_IGNORE: "true"
      SERVICE_8300_IGNORE: "true"
      SERVICE_8301_IGNORE: "true"
      SERVICE_8302_IGNORE: "true"
      SERVICE_8400_IGNORE: "true"
      SERVICE_8600_IGNORE: "true"
    networks:
      - alpha_net
    deploy:
      mode: replicated
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - 'node.labels.cpu == enabled'

  # Consul server 2 of 3
  consul2:
    image: consul:0.7.5
    command: agent -bind=0.0.0.0 -client=0.0.0.0 -advertise='{{ GetAllInterfaces | include "network" "192.168.1.0/24" | attr "address" }}' -log-level=INFO -node=config-server-2 -server -bootstrap-expect=3 -rejoin -retry-join=consul1 -retry-join=consul2 -retry-join=consul3
    environment:
      SERVICE_8500_IGNORE: "true"
      SERVICE_8300_IGNORE: "true"
      SERVICE_8301_IGNORE: "true"
      SERVICE_8302_IGNORE: "true"
      SERVICE_8400_IGNORE: "true"
      SERVICE_8600_IGNORE: "true"
    networks:
      - alpha_net
    deploy:
      mode: replicated
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - 'node.labels.cpu == enabled'

  # Consul server 3 of 3
  consul3:
    image: consul:0.7.5
    command: agent -bind=0.0.0.0 -client=0.0.0.0 -advertise='{{ GetAllInterfaces | include "network" "192.168.1.0/24" | attr "address" }}' -log-level=INFO -node=config-server-3 -server -bootstrap-expect=3 -rejoin -retry-join=consul1 -retry-join=consul2 -retry-join=consul3
    environment:
      SERVICE_8500_IGNORE: "true"
      SERVICE_8300_IGNORE: "true"
      SERVICE_8301_IGNORE: "true"
      SERVICE_8302_IGNORE: "true"
      SERVICE_8400_IGNORE: "true"
      SERVICE_8600_IGNORE: "true"
    networks:
      - alpha_net
    deploy:
      mode: replicated
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - 'node.labels.cpu == enabled'
networks:
  alpha_net:
    external: true

The above fails, since container running consul1 cannot resolve consul2 and consul3 into IP addresses.

    2017/02/22 01:58:58 [INFO] agent: (LAN) joining: [consul1 consul2 consul3]
    2017/02/22 01:58:58 [WARN] memberlist: Failed to resolve consul2: lookup consul2 on 127.0.0.11:53: no such host
    2017/02/22 01:58:58 [WARN] memberlist: Failed to resolve consul3: lookup consul3 on 127.0.0.11:53: no such host
    2017/02/22 01:58:58 [INFO] agent: (LAN) joined: 1 Err: <nil>
    2017/02/22 01:58:58 [INFO] agent: Join completed. Synced with 1 initial agents
    2017/02/22 01:59:05 [ERR] agent: failed to sync remote state: No cluster leader

And, if I manually attach to the container for consul1 with docker exec -it <container_id> /bin/sh, I can nslookup services running on the same node, but not services running on a different node.

/ # nslookup consul1
Name:      consul1
Address 1: 192.168.1.31 ip-192-168-1-31.us-west-2.compute.internal
/ # nslookup consul2
nslookup: can't resolve 'consul2': Name does not resolve
/ # nslookup consul3
nslookup: can't resolve 'consul3': Name does not resolve
/ # nslookup docdb
nslookup: can't resolve 'docdb': Name does not resolve
/ # nslookup userdb
Name:      userdb
Address 1: 192.168.1.37 ip-192-168-1-37.us-west-2.compute.internal

(userdb in the list above is another service in the compose file... left out for brevity's sake)

I can reach the name server at 127.0.0.11 just fine inside the container for consul1, but it seems as if IP addresses for services running on other nodes aren't getting synchronized in the swarm network.

glorious-beard avatar Feb 22 '17 02:02 glorious-beard

One more data point... ~~creating the above services in the docker-compose file manually with service create --name XXX calls does permit cross-node DNS IP resolution.~~.

If I manually create the services with service create --name X --network alpha_net, I sometimes (not consistently) see the same behavior.

glorious-beard avatar Feb 22 '17 18:02 glorious-beard

I see the same issue on aws. Anyone has any recommendation or a workaround?

augmento avatar Sep 07 '17 18:09 augmento

I'm seeing similar results. Also without the AWS stuff. For me I only have a master node (at a hoster) and one worker node (home linux box with static ip). There are 7 containers distributed between them. I've checked the swarm port 7946 TCP and it is reachable on the hoster and at my linux box using the external host IPs. Distribution works as expected, but the containers on the linux box can't lookup the names at the hoster containers but the other way around. If I inspect the nodes and try to ping the IPs instead of the names within the containers, it doesn't work as well. Funny thing is, that the containers on each node can ping the other containers on the same node. I'm not using any special/additional network but just deploying the stack via: docker stack deploy -c docker-compose.yml

I've read not to use ping, but ping works on the same node. But I also tried nslookup without luck.

The two nodes are running Ubuntu 16.04. as a host OS with the latest (17.06.2-ce) docker version. The nodes are running 4.4.0-93-generic and 4.4.0-87-generic Ubuntu kernels.

I'm a bit lost as the other guys, where to look further.

vguna avatar Sep 07 '17 23:09 vguna

@zen-chetan what if you do not create the network before running docker stack deploy ? In my production-stack.yml there is no defining of networks, whether within services nor in the top-level-definition. When deploying the stack, an overlay network will be created by docker:

root@docker1:/data/monitoring# cat network.yml 
version: '3.3'

services:

  influxdb:
    image: influxdb
    hostname: monitoring-influxdb
    volumes:
      - /data/monitoring/data/influxdb/var-lib-influxdb:/var/lib/influxdb
      - /etc/localtime:/etc/localtime:ro
        
  telegraf:
    image: telegraf
root@docker1:/data/monitoring# docker stack deploy -c network.yml networking
Creating network networking_default
Creating service networking_telegraf
Creating service networking_influxdb
root@docker1:/data/monitoring# docker network ls | grep networking_default
k18skhvdiwgh        networking_default    overlay             swarm

//edited

# docker --version
Docker version 17.07.0-ce, build 8784753

rdxmb avatar Sep 13 '17 08:09 rdxmb

Our product requires an attachable overlay network, which isn't supported in the docker compose yml file, AFAICT.

On Sep 13, 2017, at 1:46 AM, rdxmb <[email protected]mailto:[email protected]> wrote:

@zen-chetanhttps://github.com/zen-chetan what if you do not create the network before running docker stack deploy ? In my production-stack.yml there is no defining of networks, whether within services nor in the top-level-definition. When deploying the stack, an overlay network will be created by docker:

root@docker1:/data/monitoring# cat network.yml version: '3.3'

services:

influxdb: image: influxdb hostname: monitoring-influxdb volumes: - /data/monitoring/data/influxdb/var-lib-influxdb:/var/lib/influxdb - /etc/localtime:/etc/localtime:ro

telegraf: image: telegraf root@docker1:/data/monitoring# docker stack deploy -c network.yml networking Creating network networking_default Creating service networking_telegraf Creating service networking_influxdb root@docker1:/data/monitoring# docker network ls | grep networking_default k18skhvdiwgh networking_default overlay swarm

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/docker/swarmkit/issues/1429#issuecomment-329100815, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQnKh9X_5cOajAzq5aLDEOPtWHUQ0CWFks5sh5ZogaJpZM4Jr-UD.

glorious-beard avatar Sep 13 '17 08:09 glorious-beard

After I made sure that all the 3 hosts had docker-ce installed, with swarm, and using docker service create to launch containers, I was able to reach the containers across hosts. Ping using the container name also worked across hosts. I am not using docker stack deploy. I created the overlay network and used the same network name when launching the containers using service create. I still need to resolve a few issues related to making certain services talk to each other (which may be related to publishing ports etc) but I think I have crossed the hurdle I faced with cross container communication which I think was due to docker version mismatch across hosts.

augmento avatar Sep 13 '17 16:09 augmento

I got it working now. Here are some insights that may help others:

  • Don't try to use docker for Windows to get multi-node mesh network (swarm) running. It's simply not (yet) supported. If you google around, you find some Microsoft blogs telling about it. Also the docker documentation mentions it somewhere. It would be nice, if docker cmd itself would print an error/warning when trying to set something up under Windows - which simply doesn't work. It does work on a single node though.
  • Don't try to use a Linux in a Virtualbox under Windows and hoping to workaround with it. It, of course, doesn't work since it has the same limitations as the underlying Windows.
  • Make sure you open ports at least 7946 tcp/udp and 4789 udp for worker nodes. For master also 2377 tcp. Use e.g. netcat -vz -u for udp check. Without -u for tcp.
  • Make sure to pass --advertise-addr on the docker worker node (!) when executing the join swarm command. Here put the external IP address of the worker node which has the mentioned ports open. Doublecheck that the ports are really open!
  • Using ping to check the DNS resolution for container names works. If you forget the --advertise-addr or not opening port 7946 results in DNS resolution not working on worker nodes!

My main fault was using Windows and not specifying --advertise-addr - since I thought the IP address of the master was already specified correctly by the generated join token cmd. But it's important to specify the worker node IP as well on join!

I hope that helps someone. Most of the stuff is mentioned in the documentation and here in the comments. But only the combination of the mentioned points worked for me.

BTW: I've tested this with docker-compose v3.3 syntax and deployed it via docker stack deploy with the default overlay network. As a kernel I used the Ubuntu 16.04 LTS, 4.4.0-93-generic kernel.

vguna avatar Sep 13 '17 23:09 vguna

Having similar trouble connecting services between hosts. If I am in a container in a worker node and use netcat -vz to try to connect to the manager node host and port, I get the following error:

root@adc78cf2c38d:/# netcat -vz cordoba.<company>.com 8786 
DNS fwd/rev mismatch: cordoba.<company>.com != <ip-address>-static.hfc.comcastbusiness.net 
cordoba.company.com [<ip-address>] 8786 (?) open

Values with <> around them are to anonymize the output. cordoba.<company>.com is the manager node host. Are there some external network changes that I need to make to get swarm to work?

rrtaylor avatar Sep 15 '17 14:09 rrtaylor

The netcat was meant for testing the open ports on master and worker hosts, not the containers. I haven't tried whether it is also accessible from inside the containers. I didn't have to change or specify any network settings at all. Default worked fine for me (via docker-compose).

BTW: what is port 8786? What OSes are you using?

vguna avatar Sep 15 '17 20:09 vguna

I ran into this issue, and got it working with @vguna's tips. In particular, I had to set the --advertise-addr on my worker node to the external IP.

My concern is, while my manager node as a fixed IP, my worker nodes have dynamic IPs. According to the docs, this should be fine, and I've confirmed that the manager has no problem switching to the the new node IP when it changes. So, when the worker IP changes, the manager will still see the node as healthy and assign tasks to it, but the advertise address will be the old address, so those containers will be unreachable from other nodes.

rubidot avatar Jan 28 '18 16:01 rubidot