swarmkit icon indicating copy to clipboard operation
swarmkit copied to clipboard

Docker Swarm, Upgrade from 18.03.1 to 18.06.1 - Ghost networks when removing stacks

Open nsteinmetz opened this issue 7 years ago • 8 comments

Opened initially as https://github.com/docker/for-linux/issues/424 but think it's more approriate here

Expected behavior

Hello, I have a weird behaviour since upgrading from docker 18.3.1 to 18.6.1 and traefik 1.6.4 to 1.6.6 : • I deploy docker stacks via a docker-compose.yml file which leads to creating a given network per stack and connecting traefik to this stacks (via docker service update --network-add <network> <traefik>) - so far it worked well • when I deploy a new version of the stack, I remove traefik from the network (docker service update --network-rm <network> <traefik>) and destroy the stack (docker stack rm ...). It removed stack and network. I had to wait a little bit before network is really deleted but it work. Then, I could deploy a create a new version of the stack and add traefik back to the network

I do not update services as there are some dbs in the stack for which I may need to run migrations. The need to be run one only.

Actual behavior

When I redeploy the stack, I do:

if [ $($DOCKER_BIN stack ls --format "{{.Name}}" |grep ^${COMPOSE_PROJECT_NAME}$ |wc -l) -ne "0" ]; then
    $DOCKER_BIN service update --network-rm ${COMPOSE_PROJECT_NAME}_network traefik_traefik
    $DOCKER_BIN stack rm ${COMPOSE_PROJECT_NAME}
    sleep 20
    $DOCKER_BIN network prune --force
fi

$DOCKER_BIN stack deploy --compose-file docker-compose.yml --with-registry-auth ${COMPOSE_PROJECT_NAME}

if [[ ${TRAEFIK} == "docker" ]]; then
    $DOCKER_BIN service update --network-add ${COMPOSE_PROJECT_NAME}_network traefik_traefik
fi

but output is now:

...
Creating service <stack>_backoffice
failed to create service <stack>_backoffice: Error response from daemon: network <stack>_network not found

• But since the upgrade, the network is not well destroyed as traefik seems still in the network (via docker inspect network <network>)

I need to do docker stack rm traefik so that the ghost network vanish as expected and then redeploy traefik from scracth and reconnect it to other instances. Any idea on this ?

I’m trying to downgrade back to 1.6.4 to see if it’s more from docker side or traefik side... => results are the same with 1.6.4

I also noticed that for the stacks I deploy so far I did not set the overlay network as attachable - did not require it for the last 6 months - shoud I add it ? =>result is the same with attachable network.

Steps to reproduce the behavior

  • Having a swarm cluster
  • Having a docker-compose file with a given network
  • Deploy stack
  • Connect traefik to it
  • Remove traefik to it
  • Destroy stack
  • Network is still present
sudo docker network inspect <instance_network>
[
    {
        "Name": "<instance_network>",
        "Id": "11xjt1yz38o5vtw40qbzldorx",
        "Created": "2018-08-29T16:58:57.674468644+02:00",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.49.0/24",
                    "Gateway": "10.0.49.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "lb-<instance_network>": {
                "Name": "<instance_network>-endpoint",
                "EndpointID": "eda2b043adb5e211003258e553a0368d5a2f306245c7609793164a8bb3e5ebe7",
                "MacAddress": "02:42:0a:00:31:04",
                "IPv4Address": "10.0.49.4/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4123"
        },
        "Labels": {
            "com.docker.stack.namespace": "<instance>"
        },
        "Peers": [
            {
                "Name": "517efbb11671",
                "IP": "172.16.0.5"
            }
        ]
    }
]

Output of docker version:

Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:56 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:21 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 129
 Running: 123
 Paused: 0
 Stopped: 6
Images: 129
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: xqm2e3fld2vem3xali1795ug1
 Is Manager: true
 ClusterID: 7pf90t57w3oog500hniyt9rgr
 Managers: 1
 Nodes: 4
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 172.16.0.5
 Manager Addresses:
  172.16.0.5:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-134-generic
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 62.82GiB
Name: swarm1.*******.net
ID: SAAO:VFA7:YFS4:23ZK:TETY:LINA:ZOFO:URPG:5JYE:3SMU:YP4A:DZZV
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

4 VMswith each 8vCPU / 64 Go RAM

nsteinmetz avatar Aug 31 '18 08:08 nsteinmetz

By externalising the network creation on cli side and having only an external network referenced in the docker-compose file, it works again as expected. I create the network only once, attach traefik to it and period.

Modified my deployment script as follow:

# CHANGED - If stack exists - remove it
if [ $($DOCKER_BIN stack ls --format "{{.Name}}" |grep ^${COMPOSE_PROJECT_NAME}$ |wc -l) -ne "0" ]; then
    # $DOCKER_BIN service update --network-rm ${COMPOSE_PROJECT_NAME}_network traefik_traefik
    $DOCKER_BIN stack rm ${COMPOSE_PROJECT_NAME}
    sleep 20
    $DOCKER_BIN network prune --force
fi

# ADDED - Create network
if [ $($DOCKER_BIN network ls --format "{{.Name}}" |grep ^${COMPOSE_PROJECT_NAME}_network$ |wc -l) -eq "0" ]; then
    $DOCKER_BIN network create -d overlay --attachable ${COMPOSE_PROJECT_NAME}_network
fi

# Run stack
$DOCKER_BIN stack deploy --compose-file docker-compose.yml --with-registry-auth ${COMPOSE_PROJECT_NAME}

if [[ ${TRAEFIK} == "docker" ]]; then
    # ADDED Check if Traefik is already in the network or not
    network=`$DOCKER_BIN network ls --no-trunc |grep ${COMPOSE_PROJECT_NAME} |awk '{print $1}' |wc -l`
    if [[ $network -ne "0" ]]; then
        # Network exists, if Traefik is not already added in the network, add it - do nothing otherwise
        if [ $($DOCKER_BIN service inspect traefik_traefik --format="{{json .Spec.TaskTemplate.Networks}}" | grep `$DOCKER_BIN network ls --no-trunc |grep ${COMPOSE_PROJECT_NAME}_network |awk '{print $1}'` |wc -l) -eq "0" ]; then
            $DOCKER_BIN service update --network-add ${COMPOSE_PROJECT_NAME}_network traefik_traefik
        fi
    fi
fi

nsteinmetz avatar Aug 31 '18 08:08 nsteinmetz

Seems close to https://github.com/docker/swarmkit/issues/2637 but I do not create two ingress networks. I only have one and create only overlay networks.

nsteinmetz avatar Aug 31 '18 08:08 nsteinmetz

But the issue remains as I can't remove the network at the end : it only fix the ability to redeploy my stack on a given network by making it permanent. If I try to delete it after removing traefik, it remains in this ghost status till I remove the traefik stack.

nsteinmetz avatar Aug 31 '18 08:08 nsteinmetz

And for a docker-compose file:

version: '3.6'
services:
  nginx:
    image: docker-hub.admin.bigcorp.net/nginx-nginx:${COMPOSE_PROJECT_NAME}
    hostname: docker-${COMPOSE_PROJECT_NAME}-nginx
    depends_on:
      - wordpress
      - b2c
      - backoffice
      - batches
      - nginx-static
      - nginx-backoffice
      - nginx-batches
      - nginx-reports
      - nginx-data
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
        aliases:
         - docker.bigcorp.com
    deploy:
      labels:
        traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
        traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}.recette.bigcorp.com"
        traefik.frontend.entryPoints: "http,https"
        traefik.frontend.redirect.entryPoint: "https"        
        traefik.port: "443"
        traefik.protocol: "https"
        traefik.frontend.auth.basic: "${FRONT_HTPASSWD}"
        traefik.frontend.passHostHeader: "true"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"
  nginx-static:
    image: docker-hub.admin.bigcorp.net/nginx-static:${COMPOSE_PROJECT_NAME}
    hostname: docker-${COMPOSE_PROJECT_NAME}-static
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
        aliases:
         - docker-static.bigcorp.com
    deploy:
      labels:
        traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
        traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-static.recette.bigcorp.com"    
        traefik.frontend.entryPoints: "http,https"
        traefik.frontend.redirect.entryPoint: "https"        
        traefik.port: "443"
        traefik.protocol: "https"
        traefik.frontend.passHostHeader: "true"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"
  b2c:
    image: docker-hub.admin.bigcorp.net/b2c:${COMPOSE_PROJECT_NAME}
    hostname: docker-${COMPOSE_PROJECT_NAME}-b2c
    depends_on:
      - db-all
      - cassandra
      - vault
    volumes:
      - type: bind
        source: /home/bigcorp/logs/${COMPOSE_PROJECT_NAME}/b2c
        target: /var/log/tomcat
      - type: bind
        source: /home/bigcorp/flux/${COMPOSE_PROJECT_NAME}
        target: /srv/tomcat/common_b2c/temp
    environment:
      assuremieux_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}.recette.bigcorp.com
      assuremieux_static: https://${COMPOSE_PROJECT_NAME}-static.recette.bigcorp.com
      cms_url: https://${COMPOSE_PROJECT_NAME}.recette.bigcorp.com/internal
      backoffice_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com
      graylog_host: graylog.lan.bigcorp.net
      statsd_prefix: com.bigcorp.${COMPOSE_PROJECT_NAME}
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
    deploy:
      labels: 
        traefik.enable: "false"      
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"        
      placement:
        constraints:
          - node.role == worker         
  wordpress:
    image: docker-hub.admin.bigcorp.net/lf-wordpress:2018.1
    hostname: docker-${COMPOSE_PROJECT_NAME}-wordpress
    depends_on:
      - db-all
    deploy:
      labels: 
        traefik.enable: "false"      
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"        
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
  nginx-backoffice:
    image: docker-hub.admin.bigcorp.net/nginx-backoffice:${COMPOSE_PROJECT_NAME}
    hostname: docker-${COMPOSE_PROJECT_NAME}-nginx-backoffice
    depends_on:
      - backoffice
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
        aliases:
         - docker-backoffice.bigcorp.com
    deploy:
      labels:
        traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
        traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com"
        traefik.frontend.entryPoints: "http,https"
        traefik.frontend.redirect.entryPoint: "https"
        traefik.port: "443"
        traefik.protocol: "https"
        traefik.frontend.auth.basic: "${BACKOFFICE_HTPASSWD}"
        traefik.frontend.passHostHeader: "true"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"        
  backoffice:
    image: docker-hub.admin.bigcorp.net/backoffice:${COMPOSE_PROJECT_NAME}
    hostname: docker-${COMPOSE_PROJECT_NAME}-backoffice
    depends_on:
      - db-all
      - cassandra
      - vault
    volumes:
      - type: bind
        source: /home/bigcorp/logs/${COMPOSE_PROJECT_NAME}/backoffice
        target: /var/log/tomcat
      - type: bind
        source: /home/bigcorp/data/${COMPOSE_PROJECT_NAME}
        target: /srv/tomcat/data         
    environment:
      backoffice_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com
      backoffice_static_baseurl: https://${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com/public
      assuremieux_static: https://${COMPOSE_PROJECT_NAME}-static.bigcorp.com
      graylog_host: graylog.lan.bigcorp.net
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
    deploy:
      labels: 
        traefik.enable: "false"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"
      placement:
        constraints:
          - node.role == worker                 
  nginx-batches:
    image: docker-hub.admin.bigcorp.net/nginx-batches:${COMPOSE_PROJECT_NAME}
    hostname: docker-${COMPOSE_PROJECT_NAME}-nginx-batches
    depends_on:
      - batches    
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
        aliases:
         - docker-batches.bigcorp.com   
    deploy:
      labels:
        traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
        traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-batches.recette.bigcorp.com"
        traefik.frontend.entryPoints: "http,https"    
        traefik.frontend.redirect.entryPoint: "https"                
        traefik.port: "443"
        traefik.protocol: "https"
        traefik.frontend.auth.basic: "${BATCHES_HTPASSWD}"
        traefik.frontend.passHostHeader: "true"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"        
  batches:
    image: docker-hub.admin.bigcorp.net/batches:${COMPOSE_PROJECT_NAME}
    hostname: docker-${COMPOSE_PROJECT_NAME}-batches    
    depends_on:
      - db-all
      - cassandra
      - vault
    environment:
      assuremieux_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}.recette.bigcorp.com
      assuremieux_static: https://${COMPOSE_PROJECT_NAME}-static.recette.bigcorp.com
      backoffice_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}-backoffice.recette.bigcorp.com
      batches_deployment_baseurl: https://${COMPOSE_PROJECT_NAME}-batches.recette.bigcorp.com    
      cms_url: https://${COMPOSE_PROJECT_NAME}.recette.bigcorp.com/internal
      graylog_host: graylog.lan.bigcorp.net
    volumes:
      - type: bind
        source: /home/bigcorp/logs/${COMPOSE_PROJECT_NAME}/batches
        target: /var/log/tomcat
      - type: bind
        source: /home/bigcorp/data/${COMPOSE_PROJECT_NAME}
        target: /srv/tomcat/data             
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
    deploy:
      labels: 
        traefik.enable: "false"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"    
      placement:
        constraints:
          - node.role == worker             
  vault:
    image: docker-hub.admin.bigcorp.net/vault:${COMPOSE_PROJECT_NAME}
    hostname: docker-vault
    deploy:
      labels: 
        traefik.enable: "false"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"          
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
  db-all:
    image: docker-hub.admin.bigcorp.net/db-all:${COMPOSE_PROJECT_NAME}
    user: "999"
    hostname: docker-${COMPOSE_PROJECT_NAME}-db
    volumes:
      - type: bind
        source: /home/bigcorp/mysql/${COMPOSE_PROJECT_NAME}
        target: /var/lib/mysql    
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
        aliases:
          - db-benefit
          - db-wordpress
          - dbdev.lan.bigcorp.net
    deploy:
      labels: 
        traefik.enable: "false"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"        
  nginx-reports:
    image: docker-hub.admin.bigcorp.net/nginx-reports:${COMPOSE_PROJECT_NAME}
    hostname: nginx-reports
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
        aliases:
         - docker-reports.bigcorp.com   
    deploy:
      labels:
        traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
        traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-reports.recette.bigcorp.com"
        traefik.frontend.entryPoints: "http,https"    
        traefik.frontend.redirect.entryPoint: "https"                
        traefik.port: "443"
        traefik.protocol: "https"
        traefik.frontend.auth.basic: "${BATCHES_HTPASSWD}"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"           
    volumes:
      - type: bind
        source: /home/bigcorp/data/${COMPOSE_PROJECT_NAME}/reports
        target: /home/wwwbigcorp/www
  nginx-data:
    image: docker-hub.admin.bigcorp.net/nginx-data:${COMPOSE_PROJECT_NAME}
    hostname: nginx-data
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
        aliases:
         - docker-data.bigcorp.com   
    deploy:
      labels:
        traefik.docker.network: "${COMPOSE_PROJECT_NAME}_bigcorp"
        traefik.frontend.rule: "Host:${COMPOSE_PROJECT_NAME}-data.recette.bigcorp.com"
        traefik.frontend.entryPoints: "http,https"    
        traefik.frontend.redirect.entryPoint: "https"                
        traefik.port: "443"
        traefik.protocol: "https"
        traefik.frontend.auth.basic: "${BATCHES_HTPASSWD}"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"           
    volumes:
      - type: bind
        source: /home/bigcorp/data/${COMPOSE_PROJECT_NAME}
        target: /home/wwwbigcorp/www
  cassandra:
    image: docker-hub.admin.bigcorp.net/cassandra:${COMPOSE_PROJECT_NAME}
    hostname: cassandra
    user: "999"
    networks:
      ${COMPOSE_PROJECT_NAME}_bigcorp:
        aliases:
         - cassandra.bigcorp.com   
    deploy:
      labels:
        traefik.enable: "false"
        com.bigcorp.scmfullrevision: "${SCM_FULLREVISION}"
        com.bigcorp.scmbranch: "${SCM_BRANCH}"           
      placement:
        constraints:
          - node.role == worker        
    volumes:
      - type: bind
        source: /home/bigcorp/cassandra/${COMPOSE_PROJECT_NAME}
        target: /var/lib/cassandra
      - type: bind
        source: /home/bigcorp/logs/${COMPOSE_PROJECT_NAME}/cassandra
        target: /var/log/cassandra        
networks:
  ${COMPOSE_PROJECT_NAME}_bigcorp:
    external: true

and initialy the network was just:

networks:
  bigcorp:
    driver: overlay

nsteinmetz avatar Sep 18 '18 11:09 nsteinmetz

Rollbacked to dopcker 18.03.1 and it works as expected - hope this will be fixed with docker 18.09 or later.

nsteinmetz avatar Sep 27 '18 08:09 nsteinmetz

Just being curious here... did you check it at Docker 19.03 ?

mvandermade avatar Jan 23 '20 20:01 mvandermade

@mvandermade we upgraded to docker 19.03 but kept the network management externally. We did not try to go back to the initial situation. As I'm no longer at this customer, I can't say more.

nsteinmetz avatar Jan 23 '20 21:01 nsteinmetz

i still have this problem in 19.03, I think i will migrate to your solution @mvandermade ; @nsteinmetz i will may be try the last 20.00 version

mik3fly avatar Mar 09 '21 22:03 mik3fly