swarmkit
swarmkit copied to clipboard
Swarm with multiple ingress networks
I noticed that my swarm has 2 ingress networks:
docker network ls
NETWORK ID NAME DRIVER SCOPE
1aaeb441a06b bridge bridge local
81d942ace568 docker_gwbridge bridge local
03a8cf90b847 host host local
8c6oqwchdzvf ingress overlay swarm
mfoezf9fniby ingress overlay swarm
2918a2ddc532 none null local
I think as a consequence, one of my services fails to start - it remains in state starting
forever and when I docker inspect
the correspoding container, it says:
"State": {
"Status": "created",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 128,
"Error": "network 8c6oqwchdzvfgayqkzsq3be7m not found",
"StartedAt": "0001-01-01T00:00:00Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
So it seems that it fails to find one of the 2 ingress networks.
docker version
Client:
Version: 18.03.0-ce
API version: 1.37
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:05:35 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.0-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:14:32 2018
OS/Arch: linux/amd64
Experimental: true
- Is it normal to have multiple ingress networks and, if not, how can this happen?
- How can I resolve this for my current swarm.
EDIT 1
I checked my other running services and none of them uses ingress network mfoezf9fniby
. So I tried docker network rm mfoezf9fniby
but this fails with Error response from daemon: network mfoezf9fnibyov8ps098ngvjy not found
. After that, running docker network ls
still shows the 2 ingress networks.
EDIT 2
Running docker network ls
on a different node only lists 1 ingress network (network mfoezf9fniby
is gone). So it seems that the node on which the service task fails has stale data?
Inspecting docker.log on the corrupt node constantly shows the following entries:
May 19 14:28:33 moby root: time="2018-05-19T14:28:33.651593661Z" level=warning msg="error locating sandbox id f3ce58d7eccbbd270959f73e141818b2310ffff199704e7a2a308b42e5903a89: sandbox f3ce58d7eccbbd270959f73e
141818b2310ffff199704e7a2a308b42e5903a89 not found"
May 19 14:28:33 moby root: time="2018-05-19T14:28:33.652546451Z" level=error msg="fc016a345607573568b64824f6a40dcc2226b4620641b5cad8613558d92d5809 cleanup: failed to delete container from containerd: no such
container"
I tried docker rm -f fc016a345607573568b64824f6a40dcc2226b4620641b5cad8613558d92d5809
which completed successfully. I turns out that this container was the service task that was in starting state forever. The service deployment then picked a different node automatically and launched a new service task. But again, the service could not be started. I ran docker network ls
on the newly picked node and again, 2 ingress networks were shown (both with the same ID like on the original node). And again, the service could not be started.
I should also mention that I am using docker-for-aws - don't know if that matters.
ping @ctelfer
I was able to resolve this as follows:
- I completely removed that stack with the failing service using
docker stack rm
. - I terminated the 2 nodes that showed the 2 ingress networks in the output of
docker network ls
. - I then tried to recreate the stack using
docker stack deploy
. This time, the service creation failed withError response from daemon: network <service-name>_default not found
. - I added a custom network to the service definition to avoid hitting the default network.
- Now service creation succeeded, but again, the service did not start and 2 ingress networks showed up on the node that ran the service task.
- Again, I completely removed the corresponding stack
- I ran
docker network prune
which apparently deleted an existing network called<service-name>_default
- I removed the node that showed the 2 ingress networks
- I recreated the stack and this time everything worked fine
I recall there was an issue in the past where nodes upgraded from an old version did not have the "ingress" attribute set on the ingress network; were these existing nodes, and upgraded from an older version of docker (and if so, do you know what version?)
@thaJeztah No I performed a CloudFormation Stack Update which means that all old nodes are replaced. Also, the old nodes ran the same docker version as the new ones.
To answer the first question, no there should definitely not be two ingress networks present at the same time.
My first thought was that this had something to do with some kind of incomplete restoration of the ingress network after a dockerd restart. My second thought was that since docker network ls
only showed 1 ingress network on other (worker?) nodes, that the extra ingress network was one restored from a previous run, but one which swarm had no knowledge of leading for swarm to create a fresh one on the manager node and the other nodes. I would be curious whether both "ingress" networks (on the nodes that have 2 ingress networks) are marked as "Ingress: true" in docker network inspect
.
From @Mobe91 's last comments, it sounds like something needed to be pruned whether it was FOO_default or ingress.
My second thought was that since docker network ls only showed 1 ingress network on other (worker?) nodes
It were other manager nodes that showed 1 ingress network. I think I did not check the worker nodes.
I would be curious whether both "ingress" networks (on the nodes that have 2 ingress networks) are marked as "Ingress: true" in docker network inspect.
Unfortunately, I did not save the output of docker network inspect
when this happened...
Hi,
Seems the same bug as mine with Docker 18.06.1
I should have opened it here maybe : https://github.com/docker/for-linux/issues/424
I encounter the same problem and after few tests it's seems caused by network with duplicated name and local scope.
It could be reproduced on a fresh Ubuntu 18.04 install
root@server:~# docker --version
Docker version 18.09.5, build e8ff056
root@server:~# docker swarm init
Swarm initialized: current node (t8qsjoaroynwsxeq9la4f0i5b) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-24qa1rusq46mmgjah41z8pvtyghnlrz9g3u7q49keol2p0r5te-9ek99ggjlgrtnd2iiqy5h44bw 51.15.155.120:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
root@server:~# docker network create --scope=local test_stack_default
02a5d86ad0fa904c268efa3d0debe7efd83d9438eca7288372406c118bce36c4
root@server:~# docker network create --scope=swarm additional
nf74b2kya75mui7zzj5ofdqs8
root@server:~# cat test_stack.yml
version: '3.4'
services:
test_service:
image: "traefik"
networks:
- default
- additional
networks:
additional:
external: true
root@server:~# docker stack deploy -c test_stack.yml test_stack
Creating network test_stack_default
Creating service test_stack_test_service
root@server:~# docker network ls
NETWORK ID NAME DRIVER SCOPE
nf74b2kya75m additional bridge swarm
8d04df932427 bridge bridge local
7c3601152798 docker_gwbridge bridge local
150cf9c0525f host host local
xqz9vq824pa1 ingress overlay swarm
560fc2bccd2e none null local
02a5d86ad0fa test_stack_default bridge local
lrp4mt5zjwmf test_stack_default overlay swarm
root@server:~# docker service ps test_stack_test_service --no-trunc
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
8je5ervdw0udo880i4ryo0ri9 test_stack_test_service.1 traefik:latest@sha256:02cfdb77b0cd82d973dffb3dafe498283f82399bd75b335797d7f0fe3ebeccb8 server Running Running 1 second ago
root@server:~# service docker restart
root@server:~# docker service ps test_stack_test_service --no-trunc
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
kltfyg26kiydgoextni5kt7kb test_stack_test_service.1 traefik:latest@sha256:02cfdb77b0cd82d973dffb3dafe498283f82399bd75b335797d7f0fe3ebeccb8 server Ready Rejected 4 seconds ago "network nf74b2kya75mui7zzj5ofdqs8 exists"
t1ux7w6ulsildjqwut3mysrp1 \_ test_stack_test_service.1 traefik:latest@sha256:02cfdb77b0cd82d973dffb3dafe498283f82399bd75b335797d7f0fe3ebeccb8 server Shutdown Rejected 9 seconds ago "network nf74b2kya75mui7zzj5ofdqs8 exists"
8je5ervdw0udo880i4ryo0ri9 \_ test_stack_test_service.1 traefik:latest@sha256:02cfdb77b0cd82d973dffb3dafe498283f82399bd75b335797d7f0fe3ebeccb8 server Shutdown Complete 9 seconds ago
root@server:~#
This just happened to me again.
@ctelfer
I would be curious whether both "ingress" networks (on the nodes that have 2 ingress networks) are marked as "Ingress: true" in docker network inspect.
I checked this time. Both ingress networks are marked as "Ingress: true" in the output of docker network inspect
.
FYI @cypx checked with 19.03.5, still same behavior as your result.