Docker Swarm, Upgrade from 18.03.1 to 18.06.1 - Ghost networks when removing stacks
Opened initially as https://github.com/docker/for-linux/issues/424 but think it's more approriate here
Expected behavior
Hello, I have a weird behaviour since upgrading from docker 18.3.1 to 18.6.1 and traefik 1.6.4 to 1.6.6 :
• I deploy docker stacks via a docker-compose.yml file which leads to creating a given network per stack and connecting traefik to this stacks (via docker service update --network-add <network> <traefik>) - so far it worked well
• when I deploy a new version of the stack, I remove traefik from the network (docker service update --network-rm <network> <traefik>) and destroy the stack (docker stack rm ...). It removed stack and network. I had to wait a little bit before network is really deleted but it work. Then, I could deploy a create a new version of the stack and add traefik back to the network
I do not update services as there are some dbs in the stack for which I may need to run migrations. The need to be run one only.
Actual behavior
When I redeploy the stack, I do:
if [ $($DOCKER_BIN stack ls --format "{{.Name}}" |grep ^${COMPOSE_PROJECT_NAME}$ |wc -l) -ne "0" ]; then
$DOCKER_BIN service update --network-rm ${COMPOSE_PROJECT_NAME}_network traefik_traefik
$DOCKER_BIN stack rm ${COMPOSE_PROJECT_NAME}
sleep 20
$DOCKER_BIN network prune --force
fi
$DOCKER_BIN stack deploy --compose-file docker-compose.yml --with-registry-auth ${COMPOSE_PROJECT_NAME}
if [[ ${TRAEFIK} == "docker" ]]; then
$DOCKER_BIN service update --network-add ${COMPOSE_PROJECT_NAME}_network traefik_traefik
fi
but output is now:
...
Creating service <stack>_backoffice
failed to create service <stack>_backoffice: Error response from daemon: network <stack>_network not found
• But since the upgrade, the network is not well destroyed as traefik seems still in the network (via docker inspect network <network>)
I need to do docker stack rm traefik so that the ghost network vanish as expected and then redeploy traefik from scracth and reconnect it to other instances. Any idea on this ?
I’m trying to downgrade back to 1.6.4 to see if it’s more from docker side or traefik side... => results are the same with 1.6.4
I also noticed that for the stacks I deploy so far I did not set the overlay network as attachable - did not require it for the last 6 months - shoud I add it ? =>result is the same with attachable network.
Steps to reproduce the behavior
- Having a swarm cluster
- Having a docker-compose file with a given network
- Deploy stack
- Connect traefik to it
- Remove traefik to it
- Destroy stack
- Network is still present
sudo docker network inspect <instance_network>
[
{
"Name": "<instance_network>",
"Id": "11xjt1yz38o5vtw40qbzldorx",
"Created": "2018-08-29T16:58:57.674468644+02:00",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "10.0.49.0/24",
"Gateway": "10.0.49.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"lb-<instance_network>": {
"Name": "<instance_network>-endpoint",
"EndpointID": "eda2b043adb5e211003258e553a0368d5a2f306245c7609793164a8bb3e5ebe7",
"MacAddress": "02:42:0a:00:31:04",
"IPv4Address": "10.0.49.4/24",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4123"
},
"Labels": {
"com.docker.stack.namespace": "<instance>"
},
"Peers": [
{
"Name": "517efbb11671",
"IP": "172.16.0.5"
}
]
}
]
Output of docker version:
Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:24:56 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:21 2018
OS/Arch: linux/amd64
Experimental: false
Output of docker info:
Containers: 129
Running: 123
Paused: 0
Stopped: 6
Images: 129
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: xqm2e3fld2vem3xali1795ug1
Is Manager: true
ClusterID: 7pf90t57w3oog500hniyt9rgr
Managers: 1
Nodes: 4
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 172.16.0.5
Manager Addresses:
172.16.0.5:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-134-generic
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 62.82GiB
Name: swarm1.*******.net
ID: SAAO:VFA7:YFS4:23ZK:TETY:LINA:ZOFO:URPG:5JYE:3SMU:YP4A:DZZV
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
4 VMswith each 8vCPU / 64 Go RAM
By externalising the network creation on cli side and having only an external network referenced in the docker-compose file, it works again as expected. I create the network only once, attach traefik to it and period.
Modified my deployment script as follow:
# CHANGED - If stack exists - remove it
if [ $($DOCKER_BIN stack ls --format "{{.Name}}" |grep ^${COMPOSE_PROJECT_NAME}$ |wc -l) -ne "0" ]; then
# $DOCKER_BIN service update --network-rm ${COMPOSE_PROJECT_NAME}_network traefik_traefik
$DOCKER_BIN stack rm ${COMPOSE_PROJECT_NAME}
sleep 20
$DOCKER_BIN network prune --force
fi
# ADDED - Create network
if [ $($DOCKER_BIN network ls --format "{{.Name}}" |grep ^${COMPOSE_PROJECT_NAME}_network$ |wc -l) -eq "0" ]; then
$DOCKER_BIN network create -d overlay --attachable ${COMPOSE_PROJECT_NAME}_network
fi
# Run stack
$DOCKER_BIN stack deploy --compose-file docker-compose.yml --with-registry-auth ${COMPOSE_PROJECT_NAME}
if [[ ${TRAEFIK} == "docker" ]]; then
# ADDED Check if Traefik is already in the network or not
network=`$DOCKER_BIN network ls --no-trunc |grep ${COMPOSE_PROJECT_NAME} |awk '{print $1}' |wc -l`
if [[ $network -ne "0" ]]; then
# Network exists, if Traefik is not already added in the network, add it - do nothing otherwise
if [ $($DOCKER_BIN service inspect traefik_traefik --format="{{json .Spec.TaskTemplate.Networks}}" | grep `$DOCKER_BIN network ls --no-trunc |grep ${COMPOSE_PROJECT_NAME}_network |awk '{print $1}'` |wc -l) -eq "0" ]; then
$DOCKER_BIN service update --network-add ${COMPOSE_PROJECT_NAME}_network traefik_traefik
fi
fi
fi
Seems close to https://github.com/docker/swarmkit/issues/2637 but I do not create two ingress networks. I only have one and create only overlay networks.
But the issue remains as I can't remove the network at the end : it only fix the ability to redeploy my stack on a given network by making it permanent. If I try to delete it after removing traefik, it remains in this ghost status till I remove the traefik stack.
And for a docker-compose file:
version: '3.6'
services:
nginx:
image: docker-hub.admin.bigcorp.net/nginx-nginx:${COMPOSE_PROJECT_NAME}
hostname: docker-${COMPOSE_PROJECT_NAME}-nginx
depends_on:
- wordpress
- b2c
- backoffice
- batches
- nginx-static
- nginx-backoffice
- nginx-batches
- nginx-reports
- nginx-data
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
aliases:
- docker.bigcorp.com
deploy:
labels:
traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}.recette.bigcorp.com"
traefik.frontend.entryPoints: "http,https"
traefik.frontend.redirect.entryPoint: "https"
traefik.port: "443"
traefik.protocol: "https"
traefik.frontend.auth.basic: "${FRONT_HTPASSWD}"
traefik.frontend.passHostHeader: "true"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
nginx-static:
image: docker-hub.admin.bigcorp.net/nginx-static:${COMPOSE_PROJECT_NAME}
hostname: docker-${COMPOSE_PROJECT_NAME}-static
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
aliases:
- docker-static.bigcorp.com
deploy:
labels:
traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-static.recette.bigcorp.com"
traefik.frontend.entryPoints: "http,https"
traefik.frontend.redirect.entryPoint: "https"
traefik.port: "443"
traefik.protocol: "https"
traefik.frontend.passHostHeader: "true"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
b2c:
image: docker-hub.admin.bigcorp.net/b2c:${COMPOSE_PROJECT_NAME}
hostname: docker-${COMPOSE_PROJECT_NAME}-b2c
depends_on:
- db-all
- cassandra
- vault
volumes:
- type: bind
source: /home/bigcorp/logs/${COMPOSE_PROJECT_NAME}/b2c
target: /var/log/tomcat
- type: bind
source: /home/bigcorp/flux/${COMPOSE_PROJECT_NAME}
target: /srv/tomcat/common_b2c/temp
environment:
assuremieux_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}.recette.bigcorp.com
assuremieux_static: https://${COMPOSE_PROJECT_NAME}-static.recette.bigcorp.com
cms_url: https://${COMPOSE_PROJECT_NAME}.recette.bigcorp.com/internal
backoffice_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com
graylog_host: graylog.lan.bigcorp.net
statsd_prefix: com.bigcorp.${COMPOSE_PROJECT_NAME}
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
deploy:
labels:
traefik.enable: "false"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
placement:
constraints:
- node.role == worker
wordpress:
image: docker-hub.admin.bigcorp.net/lf-wordpress:2018.1
hostname: docker-${COMPOSE_PROJECT_NAME}-wordpress
depends_on:
- db-all
deploy:
labels:
traefik.enable: "false"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
nginx-backoffice:
image: docker-hub.admin.bigcorp.net/nginx-backoffice:${COMPOSE_PROJECT_NAME}
hostname: docker-${COMPOSE_PROJECT_NAME}-nginx-backoffice
depends_on:
- backoffice
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
aliases:
- docker-backoffice.bigcorp.com
deploy:
labels:
traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com"
traefik.frontend.entryPoints: "http,https"
traefik.frontend.redirect.entryPoint: "https"
traefik.port: "443"
traefik.protocol: "https"
traefik.frontend.auth.basic: "${BACKOFFICE_HTPASSWD}"
traefik.frontend.passHostHeader: "true"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
backoffice:
image: docker-hub.admin.bigcorp.net/backoffice:${COMPOSE_PROJECT_NAME}
hostname: docker-${COMPOSE_PROJECT_NAME}-backoffice
depends_on:
- db-all
- cassandra
- vault
volumes:
- type: bind
source: /home/bigcorp/logs/${COMPOSE_PROJECT_NAME}/backoffice
target: /var/log/tomcat
- type: bind
source: /home/bigcorp/data/${COMPOSE_PROJECT_NAME}
target: /srv/tomcat/data
environment:
backoffice_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com
backoffice_static_baseurl: https://${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com/public
assuremieux_static: https://${COMPOSE_PROJECT_NAME}-static.bigcorp.com
graylog_host: graylog.lan.bigcorp.net
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
deploy:
labels:
traefik.enable: "false"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
placement:
constraints:
- node.role == worker
nginx-batches:
image: docker-hub.admin.bigcorp.net/nginx-batches:${COMPOSE_PROJECT_NAME}
hostname: docker-${COMPOSE_PROJECT_NAME}-nginx-batches
depends_on:
- batches
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
aliases:
- docker-batches.bigcorp.com
deploy:
labels:
traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-batches.recette.bigcorp.com"
traefik.frontend.entryPoints: "http,https"
traefik.frontend.redirect.entryPoint: "https"
traefik.port: "443"
traefik.protocol: "https"
traefik.frontend.auth.basic: "${BATCHES_HTPASSWD}"
traefik.frontend.passHostHeader: "true"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
batches:
image: docker-hub.admin.bigcorp.net/batches:${COMPOSE_PROJECT_NAME}
hostname: docker-${COMPOSE_PROJECT_NAME}-batches
depends_on:
- db-all
- cassandra
- vault
environment:
assuremieux_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}.recette.bigcorp.com
assuremieux_static: https://${COMPOSE_PROJECT_NAME}-static.recette.bigcorp.com
backoffice_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com
batches_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}-batches.recette.bigcorp.com
cms_url: https://${COMPOSE_PROJECT_NAME}.recette.bigcorp.com/internal
graylog_host: graylog.lan.bigcorp.net
volumes:
- type: bind
source: /home/bigcorp/logs/${COMPOSE_PROJECT_NAME}/batches
target: /var/log/tomcat
- type: bind
source: /home/bigcorp/data/${COMPOSE_PROJECT_NAME}
target: /srv/tomcat/data
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
deploy:
labels:
traefik.enable: "false"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
placement:
constraints:
- node.role == worker
vault:
image: docker-hub.admin.bigcorp.net/vault:${COMPOSE_PROJECT_NAME}
hostname: docker-vault
deploy:
labels:
traefik.enable: "false"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
db-all:
image: docker-hub.admin.bigcorp.net/db-all:${COMPOSE_PROJECT_NAME}
user: "999"
hostname: docker-${COMPOSE_PROJECT_NAME}-db
volumes:
- type: bind
source: /home/bigcorp/mysql/${COMPOSE_PROJECT_NAME}
target: /var/lib/mysql
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
aliases:
- db-benefit
- db-wordpress
- dbdev.lan.bigcorp.net
deploy:
labels:
traefik.enable: "false"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
nginx-reports:
image: docker-hub.admin.bigcorp.net/nginx-reports:${COMPOSE_PROJECT_NAME}
hostname: nginx-reports
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
aliases:
- docker-reports.bigcorp.com
deploy:
labels:
traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-reports.recette.bigcorp.com"
traefik.frontend.entryPoints: "http,https"
traefik.frontend.redirect.entryPoint: "https"
traefik.port: "443"
traefik.protocol: "https"
traefik.frontend.auth.basic: "${BATCHES_HTPASSWD}"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
volumes:
- type: bind
source: /home/bigcorp/data/${COMPOSE_PROJECT_NAME}/reports
target: /home/wwwbigcorp/www
nginx-data:
image: docker-hub.admin.bigcorp.net/nginx-data:${COMPOSE_PROJECT_NAME}
hostname: nginx-data
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
aliases:
- docker-data.bigcorp.com
deploy:
labels:
traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-data.recette.bigcorp.com"
traefik.frontend.entryPoints: "http,https"
traefik.frontend.redirect.entryPoint: "https"
traefik.port: "443"
traefik.protocol: "https"
traefik.frontend.auth.basic: "${BATCHES_HTPASSWD}"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
volumes:
- type: bind
source: /home/bigcorp/data/${COMPOSE_PROJECT_NAME}
target: /home/wwwbigcorp/www
cassandra:
image: docker-hub.admin.bigcorp.net/cassandra:${COMPOSE_PROJECT_NAME}
hostname: cassandra
user: "999"
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
aliases:
- cassandra.bigcorp.com
deploy:
labels:
traefik.enable: "false"
com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
com.bigcorp.scmbranch: "${SCM_BRANCH}"
placement:
constraints:
- node.role == worker
volumes:
- type: bind
source: /home/bigcorp/cassandra/${COMPOSE_PROJECT_NAME}
target: /var/lib/cassandra
- type: bind
source: /home/bigcorp/logs/${COMPOSE_PROJECT_NAME}/cassandra
target: /var/log/cassandra
networks:
${COMPOSE_PROJECT_NAME}_bigcorp:
external: true
and initialy the network was just:
networks:
bigcorp:
driver: overlay
Rollbacked to dopcker 18.03.1 and it works as expected - hope this will be fixed with docker 18.09 or later.
Just being curious here... did you check it at Docker 19.03 ?
@mvandermade we upgraded to docker 19.03 but kept the network management externally. We did not try to go back to the initial situation. As I'm no longer at this customer, I can't say more.
i still have this problem in 19.03, I think i will migrate to your solution @mvandermade ; @nsteinmetz i will may be try the last 20.00 version