swarmkit icon indicating copy to clipboard operation
swarmkit copied to clipboard

Swarm repeatedly retries image pull on disk-full nodes, consuming excessive bandwidth

Open Qwarctick opened this issue 9 months ago • 0 comments

Description

When using Docker Swarm, if a node fails to pull an image due to insufficient disk space, Swarm continuously retries the download in a tight loop. In our case, this behavior resulted in over 2TB of bandwidth consumed in less than 24 hours, with no container ever successfully starting.

This has a major impact on network usage and registry load, especially when image sizes exceed 1GB and failures happen on multiple nodes.

Jun 12 09:00:06 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:06.367754020Z" level=info msg="Attempting next endpoint for pull after error: failed to register layer: write /usr/share/kibana/node_modules/@elastic/request-converter>
Jun 12 09:00:06 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:06.369244637Z" level=error msg="pulling image failed" error="failed to register layer: write /usr/share/kibana/node_modules/@elastic/request-converter/dist/schema.json>
Jun 12 09:00:06 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:06.369388754Z" level=error msg="fatal task error" error="No such image: custom_registry.xyz/kibana-oss:latest@sha256:9cae160e0ee294a2a89252096f06acc1ac>
Jun 12 09:00:06 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:06.441383713Z" level=info msg="Layer sha256:1392f8e7c35ede39c33ac7c848af7d868f9fa9429f8b3f114e33fe83ab3c2a1f cleaned up"
Jun 12 09:00:06 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:06.732342895Z" level=warning msg="failed to deactivate service binding for container xyz_kibana.1.62nb9v0qktpxk3o61dxjxgugt" error="No such container: xyzonpr>
Jun 12 09:00:10 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:10.351369598Z" level=info msg="Attempting next endpoint for pull after error: failed to register layer: write /usr/share/elasticsearch/modules/x-pack-ml/platform/linux>
Jun 12 09:00:10 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:10.352582566Z" level=error msg="pulling image failed" error="failed to register layer: write /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/lib/libmk>
Jun 12 09:00:10 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:10.352738390Z" level=error msg="fatal task error" error="No such image: custom_registry.xyz/elasticsearch-oss:latest@sha256:e40d111fcf76f521e79cc3c1978>
Jun 12 09:00:10 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:10.354305414Z" level=info msg="Layer sha256:2e3739e412ea47ca6e8cea1fb174edb6cc9454d57a5c05745fd64111f5481007 cleaned up"
Jun 12 09:00:10 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:10.354333402Z" level=info msg="Layer sha256:c4793536211cd89f2fa419f22299b223208e4ee90776aa425caf3d45ef6d63bb cleaned up"
Jun 12 09:00:10 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:10.355872925Z" level=info msg="Layer sha256:7cc14c341faea04af2e1c7b3179c9e782266c27bfe4ad108b0726454ee934e99 cleaned up"
Jun 12 09:00:10 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:10.365809894Z" level=info msg="Layer sha256:5a67d7de8020dc614f50a93f2123af6a724f36ecfde6ae798404dede73fc2de1 cleaned up"
Jun 12 09:00:10 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:10.754443579Z" level=warning msg="failed to deactivate service binding for container xyz_elasticsearch.1.rf781l00imknngy9ne6k9xhmy" error="No such container: >
Jun 12 09:00:29 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:29.351123696Z" level=info msg="Attempting next endpoint for pull after error: failed to register layer: write /usr/share/elasticsearch/jdk/lib/modules: no space left o>
Jun 12 09:00:29 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:29.352362899Z" level=error msg="pulling image failed" error="failed to register layer: write /usr/share/elasticsearch/jdk/lib/modules: no space left on device" module=>
Jun 12 09:00:29 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:29.352507974Z" level=error msg="fatal task error" error="No such image: custom_registry.xyz/elasticsearch-oss:latest@sha256:e40d111fcf76f521e79cc3c1978>
Jun 12 09:00:29 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:29.352870283Z" level=info msg="Layer sha256:2e3739e412ea47ca6e8cea1fb174edb6cc9454d57a5c05745fd64111f5481007 cleaned up"
Jun 12 09:00:29 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:29.352892650Z" level=info msg="Layer sha256:c4793536211cd89f2fa419f22299b223208e4ee90776aa425caf3d45ef6d63bb cleaned up"
Jun 12 09:00:29 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:29.353942789Z" level=info msg="Layer sha256:7cc14c341faea04af2e1c7b3179c9e782266c27bfe4ad108b0726454ee934e99 cleaned up"
Jun 12 09:00:29 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:29.361776163Z" level=info msg="Layer sha256:5a67d7de8020dc614f50a93f2123af6a724f36ecfde6ae798404dede73fc2de1 cleaned up"
Jun 12 09:00:29 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:29.782420934Z" level=warning msg="failed to deactivate service binding for container xyz_elasticsearch.1.mgy3bmoxqxkxd6ozihgqqt5ck" error="No such container: >
Jun 12 09:00:34 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:34.680761055Z" level=info msg="Attempting next endpoint for pull after error: failed to register layer: write /usr/share/kibana/node_modules/@kbn/screenshotting-plugin>
Jun 12 09:00:34 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:34.682130838Z" level=error msg="pulling image failed" error="failed to register layer: write /usr/share/kibana/node_modules/@kbn/screenshotting-plugin/chromium/headles>
Jun 12 09:00:34 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:34.682269499Z" level=error msg="fatal task error" error="No such image: custom_registry.xyz/kibana-oss:latest@sha256:9cae160e0ee294a2a89252096f06acc1ac>
Jun 12 09:00:34 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:34.736480992Z" level=info msg="Layer sha256:1392f8e7c35ede39c33ac7c848af7d868f9fa9429f8b3f114e33fe83ab3c2a1f cleaned up"
Jun 12 09:00:35 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:35.090726402Z" level=warning msg="failed to deactivate service binding for container xyz_kibana.1.ypr5ejllpvwocap5warlgj5yh" error="No such container: xyzonpr>
Jun 12 09:00:49 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:49.313699548Z" level=error msg="Download failed after 1 attempts: write /var/lib/docker/tmp/GetImageBlob4091921970: no space left on device"
Jun 12 09:00:49 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:49.429887166Z" level=info msg="Attempting next endpoint for pull after error: write /var/lib/docker/tmp/GetImageBlob4091921970: no space left on device"
Jun 12 09:00:49 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:49.431869113Z" level=error msg="pulling image failed" error="write /var/lib/docker/tmp/GetImageBlob4091921970: no space left on device" module=node/agent/taskmanager n>
Jun 12 09:00:49 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:49.432603843Z" level=error msg="fatal task error" error="No such image: custom_registry.xyz/elasticsearch-oss:latest@sha256:e40d111fcf76f521e79cc3c1978>
Jun 12 09:00:49 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:49.804403917Z" level=info msg="Attempting next endpoint for pull after error: failed to register layer: write /usr/share/kibana/node_modules/@kbn/core/packages/http/br>
Jun 12 09:00:49 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:49.806400283Z" level=error msg="pulling image failed" error="failed to register layer: write /usr/share/kibana/node_modules/@kbn/core/packages/http/browser-internal/sr>
Jun 12 09:00:49 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:49.806817043Z" level=error msg="fatal task error" error="No such image: custom_registry.xyz/kibana-oss:latest@sha256:9cae160e0ee294a2a89252096f06acc1ac>
Jun 12 09:00:49 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:49.858391823Z" level=info msg="Layer sha256:1392f8e7c35ede39c33ac7c848af7d868f9fa9429f8b3f114e33fe83ab3c2a1f cleaned up"
Jun 12 09:00:49 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:49.874899512Z" level=warning msg="failed to deactivate service binding for container xyz_elasticsearch.1.nz6b5it3285ugjg3ktkgs5hww" error="No such container: >
Jun 12 09:00:50 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:00:50.250965496Z" level=warning msg="failed to deactivate service binding for container xyz_kibana.1.q31urphimw5adzriafpfc3kr3" error="No such container: xyzonpr>
Jun 12 09:01:09 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:09.739629787Z" level=info msg="Attempting next endpoint for pull after error: failed to register layer: write /usr/share/kibana/node_modules/@kbn/esql-ast/src/antlr/es>
Jun 12 09:01:09 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:09.743707971Z" level=error msg="pulling image failed" error="failed to register layer: write /usr/share/kibana/node_modules/@kbn/esql-ast/src/antlr/esql_lexer.js: no s>
Jun 12 09:01:09 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:09.744048446Z" level=error msg="fatal task error" error="No such image: custom_registry.xyz/kibana-oss:latest@sha256:9cae160e0ee294a2a89252096f06acc1ac>
Jun 12 09:01:09 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:09.796583454Z" level=info msg="Layer sha256:1392f8e7c35ede39c33ac7c848af7d868f9fa9429f8b3f114e33fe83ab3c2a1f cleaned up"
Jun 12 09:01:10 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:10.182461632Z" level=warning msg="failed to deactivate service binding for container xyz_kibana.1.66qibvnc32c1j4ixyp4cvfx5t" error="No such container: xyzonpr>
Jun 12 09:01:14 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:14.099756725Z" level=info msg="Attempting next endpoint for pull after error: failed to register layer: write /usr/share/elasticsearch/modules/x-pack-core/unboundid-ld>
Jun 12 09:01:14 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:14.100953593Z" level=error msg="pulling image failed" error="failed to register layer: write /usr/share/elasticsearch/modules/x-pack-core/unboundid-ldapsdk-6.0.3.jar: >
Jun 12 09:01:14 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:14.101127972Z" level=error msg="fatal task error" error="No such image: custom_registry.xyz/elasticsearch-oss:latest@sha256:e40d111fcf76f521e79cc3c1978>
Jun 12 09:01:14 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:14.103125817Z" level=info msg="Layer sha256:2e3739e412ea47ca6e8cea1fb174edb6cc9454d57a5c05745fd64111f5481007 cleaned up"
Jun 12 09:01:14 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:14.103227288Z" level=info msg="Layer sha256:c4793536211cd89f2fa419f22299b223208e4ee90776aa425caf3d45ef6d63bb cleaned up"
Jun 12 09:01:14 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:14.104358173Z" level=info msg="Layer sha256:7cc14c341faea04af2e1c7b3179c9e782266c27bfe4ad108b0726454ee934e99 cleaned up"
Jun 12 09:01:14 xyz-dev-vmt dockerd[658124]: time="2025-06-12T09:01:14.113088597Z" level=info msg="Layer sha256:5a67d7de8020dc614f50a93f2123af6a724f36ecfde6ae798404dede73fc2de1 cleaned up"

Expected Behavior

If a node cannot pull an image (e.g., due to disk full), Swarm should:

  • Retry with exponential backoff or delay
  • Optionally stop retrying after N attempts
  • Possibly mark the node as temporarily unschedulable for that service

Steps to Reproduce

  • Create a Swarm cluster with at least one node having very low disk space
  • Deploy a service with an image that is not yet present on the node
  • Observe that the node fails to pull the image
  • Swarm retries the pull repeatedly, consuming bandwidth

Workarounds Tried

  • Manually labeling nodes to avoid scheduling
  • Preloading images manually
  • Using restart-policy (not effective since pull happens before container start)

Environment

Client: Docker Engine - Community
 Version:           28.2.2
 API version:       1.50
 Go version:        go1.24.3
 Git commit:        e6534b4
 Built:             Fri May 30 12:07:27 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.2.2
  API version:      1.50 (minimum version 1.24)
  Go version:       go1.24.3
  Git commit:       45873be
  Built:            Fri May 30 12:07:27 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
Client: Docker Engine - Community
 Version:    28.2.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.24.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.36.2
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 4
  Running: 4
  Paused: 0
  Stopped: 0
 Images: 22
 Server Version: 28.2.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: active
  NodeID: mk7xr7dh6mdbah6uxs6fwmbqn
  Is Manager: true
  ClusterID: r62w7eooox7fnkal1jq8r046a
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.10.XX.XX
  Manager Addresses:
   10.10.XX.XX:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.5-0-g59923ef
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-60-generic
 Operating System: Ubuntu 24.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.25GiB
 Name: xyz-dev-vmt
 ID: 6910a535-2bb9-4efd-bf9a-1eaea68b43fa
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false

Would it be possible to:

  • Introduce configurable pull retry limits or backoff strategies?
  • Provide node-level detection of repeated pull failures and mark nodes as degraded?

Thank you!

Qwarctick avatar Jun 12 '25 10:06 Qwarctick