compose icon indicating copy to clipboard operation
compose copied to clipboard

[BUG] sporadic failure in the setup of container networking: overlay network not found during container initialization

Open ysautter opened this issue 7 months ago • 10 comments

Description

When defining two networks one of which is an overlay network (the host is initialized as swarm manager) and assigning it to a service in the docker compose file, the start is sporadically aborted with the following error message: Error response from daemon: failed to set up container networking: could not find a network matching network mode <overlay-network-name>: network <overlay-network-name> not found

Expected behaviour should be that the service and networking definition is created everytime without error.

Steps To Reproduce

Using the following minimal working example the error message can be reproduced every once in a while (Note I don't know if the driver_opts is necessary, but it is what we used in our production environment where we noticed the error):

services:
  nginx:
    image: nginx:latest
    networks:
      - net
      - second-net

networks:
  net:
    driver: overlay
    attachable: true
    name: net
    external: false
    driver_opts:
      encrypted: "true"
  second-net:
    name: second-net

Executing docker compose up will result in the following error once in a while:

[+] Running 2/3
 ✔ Network net                                         Created                                      0.0s
 ✔ Network second-net                                  Created                                      0.1s
 ⠸ Container debug-nginx-1                             Starting                                     0.3s
Error response from daemon: failed to set up container networking: could not find a network matching network mode net: network net not found

Because the error seems to appear only sporadically I wrote a simple script to perform the same actions everytime:

# Enter your advertise-addr here
ADVERTISE_ADDR="x.x.x.x"

docker swarm leave --force >/dev/null 2>&1
docker swarm init --advertise-addr "$ADVERTISE_ADDR" >/dev/null 2>&1

while true; do
  docker compose down -v >/dev/null 2>&1 && docker compose down -v >/dev/null 2>&1
  output=$(docker compose up -d --force-recreate 2>&1)
  if error_output=$(echo "$output" | grep "Error"); then
    echo
    echo $error_output
    echo
  else
    echo
    echo "Everything OK"
    echo
  fi
done

This will result in an output which looks like this:


Everything OK


Error response from daemon: failed to set up container networking: could not find a network matching network mode net: network net not found


Everything OK


Everything OK


Everything OK


Error response from daemon: failed to set up container networking: could not find a network matching network mode net: network net not found


Everything OK


Error response from daemon: failed to set up container networking: could not find a network matching network mode net: network net not found


Everything OK

Compose Version

Docker Compose version 2.36.0

Docker Environment

Client:
 Version:    28.1.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  0.23.0
    Path:     /usr/lib/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  2.36.0
    Path:     /usr/lib/docker/cli-plugins/docker-compose

Server:
 Containers: 6
  Running: 4
  Paused: 0
  Stopped: 2
 Images: 17
 Server Version: 28.1.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: wdkuxge4ny157zriyaq1r0c8i
  Is Manager: true
  ClusterID: vcsfsuzdr2xqe2w89p0skqty5
  Managers: 1
  Nodes: 1
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 192.168.178.126
  Manager Addresses:
   192.168.178.126:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 061792f0ecf3684fb30a3a0eb006799b8c6638a7.m
 runc version:
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.14.6-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 31.24GiB
 Name: YST
 ID: BETC:KIM3:OQXZ:CPL5:5KAO:FVML:5XOD:TDAA:KJLA:4MAV:DYE6:F5SL
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: ysautter
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false

Anything else?

The issue also occures with docker compose version 2.35.1

ysautter avatar May 22 '25 10:05 ysautter

Can you reproduce when docker swarm is disable ?

ndeloof avatar May 25 '25 20:05 ndeloof

No because I can not create an overlay network when the host is not a swarm manager. Starting the docker compose when docker swarm is disabled results in the following error:

Network net Error failed to create network net: Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.

ysautter avatar May 26 '25 08:05 ysautter

Noticed something weird :

$ docker compose -f overlay.yaml down -v
[+] Running 3/3
 ✔ Container truc-nginx-1  Removed                                                                                                                                                 0.2s 
 ✔ Network second-net      Removed                                                                                                                                                 0.2s 
 ! Network net             Resource is still in use                                                                                                                                0.0s 

# Let's try again
$ docker compose -f overlay.yaml down -v
[+] Running 1/1
 ✔ Network net  Removed                         

This demonstrates a delay exists with overlay network in swarm mode between container removal and network being considered unused. I assume some asynchrony takes place as the swarm cluster manager is replicating state between nodes.

Your issue is a comparable one with overlay network being create and a container attached within a very short delay, which randomly triggers "network net not found" error

Docker Compose can't mange such an unpredictable behavior. Docker engine's NetworkCreate should not require client to "wait a few" before network can actually be attached by a container. Please open an issue on github.com/moby/moby

ndeloof avatar May 26 '25 12:05 ndeloof

I suspect this is because of the way swarm-managed networks are deployed dynamically, but @robmry may be able to fill me in.

Non-swarm networks are created on the node where the command is executed. For swarm-networks, creating a network only creates the "definition" of the network, but doesn't create the actual network on all nodes in the cluster. The actual network is created when a service "tasK" (container backing a swarm service) is scheduled to be deployed on a specific node. By default, such networks cannot be used by non-swarm containers (this was an initial security constraint to only allow managed services from accessing the network, as swarm cluster nodes (workers) are designed with least-privilege). The --attachable option was added to allow access to the network from non-swarm containers (for (e.g.) debugging purposes to allow running a one-off container on a node that connects to the network), but that feature still depends on the network to be rolled-out by swarm.

thaJeztah avatar May 26 '25 12:05 thaJeztah

@ndeloof, @thaJeztah ... that all sounds plausible - @corhere, I think you're looking at issues in this area at the moment?

robmry avatar May 27 '25 10:05 robmry

Noticed something weird :

$ docker compose -f overlay.yaml down -v
[+] Running 3/3
 ✔ Container truc-nginx-1  Removed                                                                                                                                                 0.2s 
 ✔ Network second-net      Removed                                                                                                                                                 0.2s 
 ! Network net             Resource is still in use                                                                                                                                0.0s 

# Let's try again
$ docker compose -f overlay.yaml down -v
[+] Running 1/1
 ✔ Network net  Removed                         

This demonstrates a delay exists with overlay network in swarm mode between container removal and network being considered unused. I assume some asynchrony takes place as the swarm cluster manager is replicating state between nodes.

Your issue is a comparable one with overlay network being create and a container attached within a very short delay, which randomly triggers "network net not found" error

Docker Compose can't mange such an unpredictable behavior. Docker engine's NetworkCreate should not require client to "wait a few" before network can actually be attached by a container. Please open an issue on github.com/moby/moby

This is also, why the script I wrote executed, docker compose down -v two times.

Non-swarm networks are created on the node where the command is executed. For swarm-networks, creating a network only creates the "definition" of the network, but doesn't create the actual network on all nodes in the cluster. The actual network is created when a service "tasK" (container backing a swarm service) is scheduled to be deployed on a specific node. By default, such networks cannot be used by non-swarm containers (this was an initial security constraint to only allow managed services from accessing the network, as swarm cluster nodes (workers) are designed with least-privilege). The --attachable option was added to allow access to the network from non-swarm containers (for (e.g.) debugging purposes to allow running a one-off container on a node that connects to the network), but that feature still depends on the network to be rolled-out by swarm.

Unfortunately, I am not able to solve my needs only with docker swarm services and needed a more flexible solution to orchestrate distributed containers. The docker swarm overlay network seemed like the perfect fit for my needs, as the containers can communicate even if they are not running on the same host. Simply attaching the container to the overlay network until it is created seems to work, but feels rather hacky. Although, the documentation also states that without the --attachable flag non-swarm containers are not able to join an overlay network, I was not aware until now that the --attachable flag is more meant for debugging. On the other hand I would argue that the swarm overlay network is a powerful tool when implemented with a more predictable behavior for non-swarm containers.

ysautter avatar May 27 '25 11:05 ysautter

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 26 '25 00:10 github-actions[bot]

I seem to be having the same issue. Has anyone found any resolution or made progress?

checkbook-org avatar Nov 06 '25 15:11 checkbook-org

This issue has been automatically marked as not stale anymore due to the recent activity.

stale[bot] avatar Nov 06 '25 15:11 stale[bot]

I seem to have found a workaround. I also have initialized my swam with --advertise--addr [manager-IP]

If I use docker swarm join --token [XXXXXX] [manager-IP]:2377 --advertise-addr [node-ip] --listen-addr [node-ip]

Things seem to be working now.

checkbook-org avatar Nov 07 '25 11:11 checkbook-org