compose [BUG] Networking seems broken in 2.40.3

Description

Hi,

I upgraded from 2.40.2 to 2.40.3:

APT-Log:

Upgrade: docker-compose-plugin:amd64 (2.40.2-1~ubuntu.24.04~noble, 2.40.3-1~ubuntu.24.04~noble)

Since the upgrade the networking seems to be completely broken. Services cant reach each other or reach the wrong service.

After downgrading from 2.40.3 to 2.40.2 everything is working again.

I looked through the issues and didn't find any. Does anyone else have the same problem?

Steps To Reproduce

Sorry, I am a little bit unsure about this.

In fact we didn't change our compose setup. Simply the networking seems to go crazy after the release v2.40.3.

We start our compose setup (normally working)
Services contacting each other contact the wrong services or cannot contact the other service at all.

Compose Version

$ docker compose version
Docker Compose version v2.40.3

Docker Environment

$ docker info
Client: Docker Engine - Community
 Version:    28.5.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.29.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.40.3
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 32
  Running: 31
  Paused: 0
  Stopped: 1
 Images: 78
 Server Version: 28.5.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b98a3aace656320842a23f4a392a33f46af97866
 runc version: v1.3.0-0-g4ca628d1
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.14.0-114036-tuxedo
 Operating System: Ubuntu 24.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 62.5GiB
 Name: -----
 ID: 83acb5da-2c02-4489-9042-4fecaefc1089
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false

Anything else?

No response

Nov 04 '25 07:11 mschop

Without a reproductible example I hardly can help here. Can you inspect container and network being created by Compose 2.40.3 and compare with earlier version ?

Nov 04 '25 14:11 ndeloof

I am just trying to figure out more details and prepare an example. I'll come back to you.

FYI: Maybe this is related to #13346 because we also have a lot of extends and depends_on in our code.

Nov 05 '25 11:11 mschop

For me the internal DNS resolution between containers (using container names) completely broke when upgrading to Docker 29 and using docker compose - docker hostnames are not resolvable anymore, containers can't find each other. When downgrading to 28.5.2 it works again. I am using quite vanilla containers and compose files.

Nov 11 '25 09:11 iquito

I can confirm the same thing. updating to Docker 29 broke internal DNS resolution and containers can't seem to talk to each other anymore. will try to downgrade

Edit: can confirm downgrading to 28.5.2 sovles it.

I am not sure if its related to compose tho or the docker engine in general

Nov 11 '25 19:11 Syzuna

It'd be great to know more about this so we can investigate. An example compose file that reproduces the issue would be ideal.

Otherwise - inspect outputs for containers and networks would be good. Ideally, as @ndeloof says, working and broken versions that we can compare.

Nov 11 '25 21:11 robmry

Or, if you can enable debugging in Docker, and send the logs from a broken service starting up - perhaps we can find clues in that.

Nov 11 '25 21:11 robmry

@robmry I could do some tests in a few days or next week, although how would I get meaningful output in this instance? The containers all start up normally, they just can't communicate with each other because no Docker hostnames are reachable. If I bash into a running container and try to ping the hostname of another running container, the Docker DNS says that hostname does not exist (even though they do run on the same network and this has worked up to Docker 29), so I would assume the problem lies somewhere in the Docker DNS part. In my case it has nothing to do with extends or depends_on. I am using Docker on Debian 12.

Nov 12 '25 09:11 iquito

@iquito can you try using a plain docker run --network xx alpine ping <other> to check a non-compose container can communicate over network to your other service ? Then please attach docker inspect output for such a working container and the one from your compose stack which can't communicate

Nov 12 '25 10:11 ndeloof

Thanks @iquito - a way to repro the issue would be ideal, there doesn't seem to be any issue with a simple compose file that just starts a couple of containers on a network. Otherwise, the inspect outputs and logs mentioned above might tell us something.

If it's happening on a swarm node (?), it could be https://github.com/moby/moby/issues/51491. (That'd make it separate from the original issue, reported against docker 28.5.1.)

Nov 12 '25 10:11 robmry

We have Docker 29.0.0 / compose 2.40.3 and also see internal name resolution problems, but so far on one of three servers only as far as i can tell. They all have Debian 12 and a closely analogous setup though, so differences might be hard to tease apart.

Processes report an errno = -11 when that happens. According to netdb.h that would be EAI_SYSTEM. Interestingly, it works for us immediately after process start, but errors develop after somewhere 40-50 minutes of run time.

Restart fixes it, for a time.

Nov 13 '25 12:11 dilbernd

Just observed it in action – compose env internal DNS seems to fail at pretty precisely 40' after each container start, and just for some containers, and on some hosts. In this case one service built on server-side Dart.

Nov 13 '25 13:11 dilbernd

@dilbernd can you please confirm this issue only applies to container set by compose, and you don't get any issue running a container with a plain docker run --network xx ... command ?

Nov 13 '25 13:11 ndeloof

More facts found in internal communication and experimentation:

Other hosts are affected, only one env was a miscommunication.
Downgrade of docker-ce to 28.5.2 did not solve the issue for us. We however restarted the containers that had been started under 29.0.0 and did not recreate (which we cannot easily do due to operational constraints) so a metadata issue may involved. Compose is still at 2.40.3.
The Issues seem to not be exclusively DNS related: We have a db <- service1 <- service2 structure. “Fixing” the issue for ~ half an hour requires restarting the db and service2, not service1. DB is only connected to from service1, service1 is only connected to from service2. The DNS error was discovered in service2. service1 does not seem to have that problem, and DB does not attempt to resolve clients (explicitly set in config), but also seems to come into a bad state.
Issue has occurred between ~25 and ~50 minutes after container start so far.

@ndeloof I have another container on that network now doing docker run --rm -ti --network affected-network alpine watch ping -c 1 service1 – should be enough, right?

Nov 13 '25 14:11 dilbernd

Thanks @dilbernd,

Processes report an errno = -11 when that happens.

Any idea what system call is returning that errno?

The Issues seem to not be exclusively DNS related: [...], and DB does not attempt to resolve clients (explicitly set in config), but also seems to come into a bad state.

What is that bad state / what's not working apart from DNS?

Have you managed to collect any container/network inspect outputs or daemon logs?

Nov 13 '25 14:11 robmry

@robmry

Any idea what system call is returning that errno?

No, can’t properly strace in the prod env. My guess would be that it’s calling through to gethostbyname in the libc since it’s during name resolution. But admittedly that’s a guess. Haven’t looked at implementation detail there or in our code so far. We don’t have a stack logged for that error so it would take a while to do that with certainty.

What is that bad state / what's not working apart from DNS?

Hard to tell, it doesn’t seem to log anything about that time in service1 or db. It’s just an observation by a colleague that restarting service2 does not fully resolve the error, but also restarting only db makes it work again.

Nov 13 '25 15:11 dilbernd

Oh yeah,

Have you managed to collect any container/network inspect outputs or daemon logs?

I can only provide heavily redacted output, are there any values you’re interested in in particular?

Nov 13 '25 15:11 dilbernd

Ok, the errno 11 is from name resolution - I wondered if it was related to the other issues you mentioned. So there may not be a non-DNS issue, we don't know yet. (For example, maybe restarting the db container restores its DNS entry.)

I can only provide heavily redacted output, are there any values you’re interested in in particular?

Hard to say, we've not got much to go on yet. But from network/container inspect outputs, any differences between when it's working and after it's failed might be interesting. Daemon (or host) logs from the point where it fails might be useful.

Daemon logs (with debug enabled) from a failed DNS lookup could be good.

Does anything else happen on the system at the point where it fails ... maybe another container or service starting/stopping, a Docker daemon or firewall reload, anything like that?

Nov 13 '25 15:11 robmry

Ok, the errno 11 is from name resolution - I wondered if it was related to the other issues you mentioned. So there may not be a non-DNS issue, we don't know yet. (For example, maybe restarting the db container restores its DNS entry.)

Yeah, very possible. It’s very unclear to us what does this, since the only error that actually shows up in logs is in one specific type of service, the backend Dart HTTP server.

The thing is that this service2 and the DB never talk to each other, service2 only talks to service1, which talks to the DB, and that’s a happy camper in the middle that does not require restart to restore service, so that makes it even more confusing.

Does anything else happen on the system at the point where it fails ... maybe another container or service starting/stopping, a Docker daemon or firewall reload, anything like that?

No, nothing. I’m looking what I can make observable aside from that.

Nov 13 '25 15:11 dilbernd

@ndeloof I have another container on that network now doing docker run --rm -ti --network affected-network alpine watch ping -c 1 service1 – should be enough, right?

This has run now for 2 hours without issue. I don’t think that that tells us much about docker v compose though, since most compose services here also seem to use the network and DNS without issue.

It seems to take very particular behaviour to trigger.

The IMO bigger indication that it’s compose is that we have already downgraded docker-ce but not docker-compose-plugin from the official apt repos on one machine and it still reoccurs there.

Nov 13 '25 16:11 dilbernd

Also experiencing this problem. Test that failed:

docker-compose.yml:

services:
  a:
    image: alpine:3.20
    command: ["sh", "-c", "sleep 1000000"]
  b:
    image: alpine:3.20
    command: ["sh", "-c", "sleep 1000000"]

docker compose up -d
docker compose exec a sh -c "apk add --no-cache bind-tools >/dev/null && getent hosts b || nslookup b"

Output:

docker compose exec a sh -c "apk add --no-cache bind-tools >/dev/null && getent hosts b || nslookup b" ;; Got SERVFAIL reply from 127.0.0.11 Server: 127.0.0.11 Address: 127.0.0.11#53 ** server can't find b: SERVFAIL

Downgrading from 29.0.0 to 28.5.2 fixes internal DNS issues...

Nov 13 '25 23:11 blackadar

Thanks @blackadar ... I'm not able to reproduce the issue using that compose file - does it fail every time for you?

Could you enable debugging and send the logs?

Nov 13 '25 23:11 robmry

We have downgraded our servers to stauch the bleeding and recreated the affected containers under the old version:

Client: Docker Engine - Community
 Version:           28.5.2
[…]
Server: Docker Engine - Community
 Engine:
  Version:          28.5.2
[…]
 containerd:
  Version:          v1.7.29
  GitCommit:        442cb34bda9a6a0fed82a2ca7cade05c5c749582
 runc:
  Version:          1.3.3
  GitCommit:        v1.3.3-0-gd842d771
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Compose version:

Docker Compose version v2.40.2

However, we still observe the same problem(s), albeit a bit more rarely, since the new version was not involved with anything related to these containers; except the docker networks that could not be recreated in space. Maybe that narrows it down – might it be the network definitions and the metadata associated with them?

I’d be interested whether the people who reported it fixed downed and re-upped their whole envs, including the docker networks.

Nov 14 '25 16:11 dilbernd

We have also recently started seeing name resolution errors with errno = 24 (out of file descriptors); services that managed with the default (1024) before the upgrade have suddenly started seeing spikes that occasionally even 64K descriptors did not suffice for. We cannot be totally confident that it’s not related to changes on our side, so I wanted to solicit feedback if anyone else is seeing vastly higher fd consumption, or if that’s even potentially related to the issue under discussion here (might just come up under a similar context by blind chance)?

We’re at this point wondering whether the issues could not also be related to the new kernel release stemming from the recent kernel security advisories, which actually prompted the apt upgrade that preceded the problems in the first place.

Did the people who downgraded roll back all the recent upgrades or only docker?

Nov 14 '25 16:11 dilbernd

We have also recently started seeing name resolution errors with errno = 24 (out of file descriptors); services that managed with the default (1024) before the upgrade have suddenly started seeing spikes that occasionally even 64K descriptors did not suffice for. We cannot be totally confident that it’s not related to changes on our side

This is a good observation :)

From the looks of it Docker 29.0.0 is when Docker Engine finally upgraded from containerd 1.7.x to 2.x series(the release notes are a bit vague on the before version, and the linked PR seems wrong as it's about CI usage? (and 2.x as previous version), while the v28 series mentions 1.7.x).

containerd 2.0 carries a change that can be considered breaking, that Docker Engine itself landed the equivalent change on it's side back in the v25 release for their systemd service file removing LimitNOFILE=infinity.

I've been waiting on this to land myself for a complete fix, but am aware of it affecting large enterprise scale deployments such as when Amazon adopted the change early and reverted it due to impact on their customers.

For a quick gist of it prior to the change in both projects, they bumped not only the hard limit of file descriptors but also the soft limit where infinity was too large on various linux hosts due to a systemd change in late 2018 IIRC. That resulted in a soft limit of over 1 billion which regressed quite a few services when containerized (causing OOM or significant processing delays with unnecessary CPU load).

Some services like Envoy relied upon the bug implicitly at the time where they've needed more than a million FDs apparently (which is also the default hard limit in Debian IIRC). Near the time of this change landing in Docker v25, Go also made a change to automatically raise the soft limit to the hard limit (although this had some conditions IIRC, so if Docker Compose relied upon that but it wasn't applicable it may be stuck on 1024). The default hard limit is inherited from systemd which should be about half a million, while the soft limit should be 1024 (for compatibility reasons).

If this is the cause, the affected application needs to raise the soft limit at runtime in it's code (proper fix), but you could also override the containerd systemd service file to have LimitNOFILE=infinity and restart that service as a quick test. For software that can raise the soft limit (such as Go does implicitly for developers), if the default hard limit is not high enough, you could bump that to infinity AFAIK and you'd be alright, but the soft limit should only be bumped per service (Nginx for example keeps the 1024 limit but bumps child processes where appropriate).

If you would like further insight to the change at both projects, I am the PR author of both and pushed for such a change.

Nov 14 '25 21:11 polarathene

@mschop Are you using docker compose "profiles"? https://docs.docker.com/compose/how-tos/profiles/

In my case that's what seems to break networking. i.e. Given the docker compose file

docker-compose.development.yaml

x-default-environment: &default-environment
  NODE_ENV: development
  TZ: "UTC"
  DB_HOST: db
  DB_USER: sa
  DB_NAME: elcc_development
  DB_PASS: &default-db-password DevPwd99!
  DB_PORT: &default-db-port 1433
  DB_TRUST_SERVER_CERTIFICATE: "true"
  DB_HEALTH_CHECK_INTERVAL_SECONDS: 5
  DB_HEALTH_CHECK_TIMEOUT_SECONDS: 10
  DB_HEALTH_CHECK_RETRIES: 3
  DB_HEALTH_CHECK_START_PERIOD_SECONDS: 5

services:
  api:
    build:
      context: ./api
      dockerfile: development.Dockerfile
    env_file:
      - ./api/.env.development
    environment:
      <<: *default-environment
      RELEASE_TAG: ${RELEASE_TAG:-development}
      GIT_COMMIT_HASH: ${GIT_COMMIT_HASH:-not-set}
    tty: true # allows attaching debugger, equivalent of docker exec -t
    init: true
    # stdin_open: true # equivalent of docker exec -i
    ports:
      - "3000:3000"
    volumes:
      - ./api:/usr/src/api
      - ./.gitignore:/usr/src/.gitignore
      - ./.prettierrc.yaml:/usr/src/.prettierrc.yaml
    depends_on:
      - db

  web:
    build:
      context: ./web
      dockerfile: development.Dockerfile
    environment:
      <<: *default-environment
      VITE_API_BASE_URL: "http://localhost:3000"
    ports:
      - "8080:8080"
    volumes:
      - ./web:/usr/src/web
      - ./.gitignore:/usr/src/.gitignore
      - ./.prettierrc.yaml:/usr/src/.prettierrc.yaml
    depends_on:
      - api

  test_api:
    build:
      context: ./api
      dockerfile: development.Dockerfile
    command: /bin/true
    env_file:
      - ./api/.env.development
    environment:
      <<: *default-environment
      NODE_ENV: test
      DB_NAME: elcc_test
      DB_HEALTH_CHECK_START_PERIOD_SECONDS: 0
    tty: true
    volumes:
      - ./api:/usr/src/api
    depends_on:
      - db

  test_web:
    build:
      context: ./web
      dockerfile: development.Dockerfile
    command: /bin/true
    environment:
      <<: *default-environment
      NODE_ENV: test
    tty: true
    volumes:
      - ./web:/usr/src/web

  db:
    image: mcr.microsoft.com/mssql/server:2019-CU28-ubuntu-20.04
    user: root
    environment:
      <<: *default-environment
      DB_HOST: "localhost"
      MSSQL_SA_PASSWORD: *default-db-password
      ACCEPT_EULA: "Y"
    ports:
      - "1433:1433"
    volumes:
      - db_data:/var/opt/mssql/data

  # For easily generating large PlantUML diagrams
  # Not relevant to production environment.
  # Accessible at http://localhost:9999
  plantuml:
    image: plantuml/plantuml-server:jetty
    ports:
      - 9999:8080
    environment:
      PLANTUML_LIMIT_SIZE: 8192
    profiles:
      - design

volumes:
  db_data:

docker compose up plantuml no longer works with the other services running. I can now only boot one profile at a time. Previously, I could simply boot new services even if they had a different "profile".

Nov 17 '25 21:11 klondikemarlen

docker compose up plantuml no longer works with the other services running. I can now only boot one profile at a time. Previously, I could simply boot new services even if they had a different "profile".

Just to verify as you haven't stated it, did you try with the profile for that service commented out so it's always up regardless? I assume it works then? What if you add another profile to a different service? Does it still work?

A simpler / smaller reproduction would be better to isolate that, if it's really the case. traefik/whoami is probably all you need on different ports with the multiple profiles.

Nov 17 '25 23:11 polarathene

docker-ce v29.0.1 seems to have fixed my networking issue so it wasnt related to compose at least in my case

Nov 18 '25 00:11 Syzuna

docker compose up plantuml no longer works with the other services running. I can now only boot one profile at a time. Previously, I could simply boot new services even if they had a different "profile".

Just to verify as you haven't stated it, did you try with the profile for that service commented out so it's always up regardless? I assume it works then? What if you add another profile to a different service? Does it still work?

A simpler / smaller reproduction would be better to isolate that, if it's really the case. traefik/whoami is probably all you need on different ports with the multiple profiles.

@polarathene Sorry! That was sloppy of me. Here is a minimal example using traefik/whoami. I'm not sure it actually recreates the same issue, but it still definitely fails in the same way.

Create the file docker-compose.yaml

services:
  # Always running (no profile)
  service-a:
    image: traefik/whoami
    ports:
      - "8081:80"
    container_name: service-a

  # Profile: extras
  service-b:
    image: traefik/whoami
    ports:
      - "8082:80"
    container_name: service-b
    profiles:
      - extras

Boot the container via docker compose up, it will build the first time.
Boot the secondary container via docker compose up service-b. This will work the first time.
Stop the containers via docker compose down or ctrl+c.
Remove any trailing stuff via docker compose down -v just to make sure it's a clean setup.
Boot the app again with docker compose up
Boot the second container with docker compose up service-b. This will now fail with message:

Attaching to service-b
Error response from daemon: failed to set up container networking: network 86cb754f592cdb1e4cf71441a8d9e207f53e1aad20b188b998484ae5857eefa6 not found

It doesn't seem to happen 100% of the time, so maybe it's just that the network for the service-b profile isn't cleaned up by docker compose down since that's a different profile?

It appears as though doing a docker compose down service-b will clean up the conflicting network ... so maybe this is me just using docker compose profiles incorrectly?

In regards to

did you try with the profile for that service commented out so it's always up regardless? I assume it works then?

Yes, when I remove the profile entirely it works as normal.

Nov 18 '25 18:11 klondikemarlen

@klondikemarlen We do not use profiles.

But: I was now able find the root-cause (for our problem).

We have a base service using a network alias:

site1_server.local:
    networks:
      default:
        aliases:
          - oss-0193244c-0285-75a8-96c5-eed41f6dd5db.local

and a second service extending the first one:

  site2_server.local:
    extends:
      service: site1_server.local
    networks: !override
      default:
        aliases:
          - oss-01932034-41e5-75e1-9d5c-47cd2867ee5b.local

When running docker compose config, in docker compose v2.40.2 the service site2_server.local has the following aliases:

    networks:
      default:
        aliases:
          - oss-01932034-41e5-75e1-9d5c-47cd2867ee5b.local

In version v2.40.3 docker compose config outputs the following:

    networks:
      default:
        aliases:
          - oss-0193244c-0285-75a8-96c5-eed41f6dd5db.local
          - oss-01932034-41e5-75e1-9d5c-47cd2867ee5b.local

This explains our problem, that other services trying to call oss-0193244c-0285-75a8-96c5-eed41f6dd5db.local are connecting to the wrong service.

👉 So the conclusion is, that !override behaves differently since v2.40.3.

Nov 21 '25 07:11 mschop

@klondikemarlen I think that's a potentially valid bug to raise (if it changed with 2.40.3 it may have been an intentional fix for something else, such that you can't have both use-cases satisfied with default logic).

The actual cause you want to report is docker compose down removing a network that still has containers configured for it (such as one that is stopped from CTRL + C). You can avoid that bug by ensuring the stopped container is recreated and assigned the newly created network with docker compose up --force-recreate ....

8. Boot the second container with docker compose up service-b. This will now fail with message:
Attaching to service-b
Error response from daemon: failed to set up container networking: network 86cb754f592cdb1e4cf71441a8d9e207f53e1aad20b188b998484ae5857eefa6 not found
It doesn't seem to happen 100% of the time, so maybe it's just that the network for the service-b profile isn't cleaned up by docker compose down since that's a different profile?

This isn't so much an issue about profiles. You'll get the same problem if you use docker compose up with each service individually, similar to how you did for service-b.

The reason this happens is because both containers are brought up with the same network, but when you bring one of them down that network was removed. docker compose down will destroy the container and do cleanup operations.

However when you have scoped the these operations, such as with individual services (or profiles to filter them out from the default set), in the view of Docker Compose, all containers brought down associated to that network could be removed, it probably shouldn't have however as the other service container is still available (even if it's stopped and visible in docker container ls -a).

Anyway, because of this when you try to bring the stopped container back up it is still configured for the now removed network and fails because that doesn't exist anymore. You can avoid this with --force-recreate which will destroy the existing container and replace it with a new one.

Worth noting is that docker compose down service-a will still remove the network, even if you don't have a container associated to service-a it's still performing the cleanup task. When you start service-b you will notice a new network is created, but it fails without --force-recreate since the existing service-b container is still associated to the removed network (they share the same network name, but the actual network ID differs, which is what matters).

I learned about the importance of --force-recreate in other projects where some images have containers that are buggy to restarts with their entrypoints running container initialization with mutations to internal state. When you CTRL + C a container it is not equivalent to docker compose down (removing the container), you get the effects of restarting a container, where it keeps any internal changes (no volume required). Similar to the importance of docker compose down -v to remove any unexpected volumes (like images built with the VOLUME instruction to persist data across changes to a compose service via an implicit anonymous volume).

For added context, when you make changes to a network such as it's subnet in your compose config, just doing --force-recreate won't be sufficient when the network hasn't been removed (in a typical deployment, rather than the one we've been discussing where it was), --force-recreate only applies to containers not networks. In this scenario you do need to leverage docker compose down or similar to remove the existing network, so that it's recreated on the next up with your settings in place.

Nov 21 '25 22:11 polarathene