moby [BUG] ALL containers which use NFS volumes do not start after reboot

trafficstars

Description

This is the ported Bug-Report from: https://github.com/docker/compose/issues/11354

I am currently facing an issue with my Intel NUC running Debian SID that has persisted for about a year. Despite trying various solutions, I have been unable to resolve it satisfactorily.

My configuration is as follows:

NUC ==(NFS - docker-compose volume)==> SYNO

I run numerous containers within my docker-compose stack, all of which are set to restart with the restart unless-stopped policy. However, upon system reboot, all containers with NFS volumes mapped fail to start automatically. They remain inactive unless manually initiated. Interestingly, initiating a manual start or restart at any other time works seamlessly, and everything functions as expected.

I anticipate that all containers should initiate during a system reboot, and suspect there may be an underlying hidden race condition that eludes my detection.

Also it seems this is the very same issue as the issues mentioned here:

Reproduce

use docker-compose (or docker cli as proven in the old Bug-Report https://github.com/docker/compose/issues/11354#issuecomment-1902235822)
configure any container
configure a NFS Volume like this:

volumes:

  share:
    name: share
    driver_opts:
      type: "nfs"
      o: "addr=192.168.178.2,nfsvers=4"
      device: ":/volume1/NFS_SHARE/"

use the named NFS-Volume in the configured container
do the restart test (docker restart container_name) after restart (docker ps)
do the reboot test (reboot) after reboots (docker ps)

Expected behavior

There shall not be any race-condition and all containers (also the ones having NFS-based volumes mounted to) shall restart.

docker version

Client: Docker Engine - Community
 Version:           25.0.0
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        e758fe5
 Built:             Thu Jan 18 17:09:59 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.0
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       615dfdf
  Built:            Thu Jan 18 17:09:59 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.27
  GitCommit:        a1496014c916f9e62104b33d1bb5bd03b0858e59
 runc:
  Version:          1.1.11
  GitCommit:        v1.1.11-0-g4bccb38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client: Docker Engine - Community
 Version:    25.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.12.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.24.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 5
  Running: 5
  Paused: 0
  Stopped: 0
 Images: 5
 Server Version: 25.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: a1496014c916f9e62104b33d1bb5bd03b0858e59
 runc version: v1.1.11-0-g4bccb38
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.6.11-amd64
 Operating System: Debian GNU/Linux trixie/sid
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 30.88GiB
 Name: h0tmann
 ID: 670157fc-30a9-4aa4-807c-4fa28aec7ec7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

Jan 21 '24 18:01 the-hotmann

Are these NFS devices available before the docker service is started? I wonder if the systemd unit needs a custom "After" condition added 🤔

Jan 21 '24 21:01 thaJeztah

Yes the SYNO runs 24/7 and is not rebooted at all.

Thats what I read somewhere else aswell. It waits for network, but somehow there is a race-condition when it comes to:

Network itself
NFS mount actually connected

But in this regards, I am not an expert, so do not qoute me :)

Jan 21 '24 21:01 the-hotmann

Hm, right, so looking at the default systemd unit for the docker service; https://github.com/moby/moby/blob/5a3a101af2ff6fae24605107b1fbcf53fbb5c38e/contrib/init/systemd/docker.service#L4-L5

It currently waits for;

network-online.target (networking)
docker.socket (the socket-activation service)
firewalld.service (firewalld)
containerd.service (containerd)
time-set.target (clock set)

The containerd service already has a local-fs.target to make sure local filesystems are mounted; https://github.com/containerd/containerd/blob/b66830ff6f8375ce8c7a583eaa03549eaa6707c4/containerd.service#L18

Which means that (because the docker service has After=containerd.service), local filesystems at least should be mounted.

I think what's needed in your setup is to have the remote-fs.target;

remote-fs.target

Similar to local-fs.target, but for remote mount points.

systemd automatically adds dependencies of type After= for this target unit to all SysV init script service units with an LSB header referring to the "$remote_fs" facility.

Given that remove filesystems are not something that's used by default by the Docker Engine, I don't think we should add this to the default systemd unit; doing so likely would delay startup of the service, which would be a regression for setups that don't use remove filesystems, but are running on a system that does have them (but perhaps can be discussed).

To add that target, you can use systemctl edit docker.service. This will create an override file that allows you to extend or override properties of the default systemd unit (Flatcar has a great page on describing this in more depth);

sudo systemctl edit docker.service

That command will create a systemd "override" (or "drop-in") file, and open it in your default editor. You can add your overrides in the file, and save it. By default, the After you specify in your override file is appended to the existing list of options in the After of the default systemd unit (which should not be edited).

## Editing /etc/systemd/system/docker.service.d/override.conf
### Anything between here and the comment below will become the new contents of the file
[Unit]
After=remote-fs.target


### Lines below this comment will be discarded

### /lib/systemd/system/docker.service
# [Unit]
# ...

After you edited and saved, you need to reload systemd to make it re-read the configuration;

sudo systemctl daemon-reload

You can check the new settings using systemctl show, which should now show the remote-fs.target included;

sudo systemctl show docker.service | grep ^After
After=containerd.service systemd-journald.socket docker.socket sysinit.target time-set.target network-online.target remote-fs.target system.slice firewalld.service basic.target

Jan 22 '24 09:01 thaJeztah

Thanks for the detailed explanation.

I have followed your commands. Here is the check-command:

$ systemctl show docker.service | grep ^After
After=network-online.target basic.target sysinit.target firewalld.service docker.socket system.slice time-set.target containerd.service remote-fs.target systemd-journald.socket

I also reloaded the systemd daemon (this should automatically be done by reboot - but did it manually anyway) and rebooted the server.

Again all containers with NFS Volumes are down. They do not start up again.

Given that remove filesystems are not something that's used by default by the Docker Engine, I don't think we should add this to the default systemd unit; doing so likely would delay startup of the service, which would be a regression for setups that don't use remove filesystems, but are running on a system that does have them (but perhaps can be discussed).

Is there a possibility to add this just if NFS (any remoteFS) is getting used anywhere in any docker container?

But like mentioned above, this apparently did not fix the issue. Thanks for your help :)

Jan 23 '24 08:01 the-hotmann

But like mentioned above, this apparently did not fix the issue.

😢 that's a shame; thanks for trying! I was hoping this would make sure that those remote filesystem mounts were up-and-running.

Possibly it requires a stronger dependency defined; more than After 🤔

Reading the documentation for After https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#Before=

Note that those settings are independent of and orthogonal to the requirement dependencies as configured by Requires=, Wants=, Requisite=, or BindsTo=.

It is a common pattern to include a unit name in both the After= and Wants= options, in which case the unit listed will be started before the unit that is configured with these options.

Perhaps that second line applies here; might be worth trying if adding remote-fs.service to Wants helps 🤔

Jan 23 '24 09:01 thaJeztah

I did the following:

systemctl edit docker.service
added remote-fs.service also to Wants:

### Editing /etc/systemd/system/docker.service.d/override.conf
### Anything between here and the comment below will become the contents of the drop-in file

[Unit]
After=remote-fs.target
Wants=remote-fs.target

### Edits below this comment will be discarded

systemctl daemon-reload
systemctl show docker.service | grep ^After:

After=docker.socket network-online.target system.slice sysinit.target containerd.service systemd-journald.socket remote-fs.target basic.target firewalld.service time-set.target

systemctl show docker.service | grep ^Wants

Wants=network-online.target remote-fs.target containerd.service

reboot

Still - the containers with NFS Volumes do not start automatically.

Jan 23 '24 11:01 the-hotmann

@thaJeztah are there any news, or is there a specific user to tag on this one?

Thanks in advance! :)

Jan 28 '24 02:01 the-hotmann

Is there any related error message in dockerd log?

sudo journalctl -e --no-pager  -u docker -g ' error while mounting volume '
# or if the above doesn't yield anything useful, try searching for the NFS server address you use
sudo journalctl -e --no-pager  -u docker -g '192.168.178.2'

Jan 29 '24 09:01 vvoland

@vvoland thanks - I will reply, once I am at home and executed the commands.

Jan 29 '24 11:01 the-hotmann

@vvoland thanks, the search for the IP itself returned something:

Jan 25 18:41:19 hostname dockerd[705]: time="2024-01-25T18:41:19.637798586+01:00" level=error msg="failed to start container" container=822210342a705a345accd6bfa16b69507b832bf01aec77ea4439f4b6d375c390 error="error while mounting volume '/var/lib/docker/volumes/share/_data': failed to mount local volume: mount :/volume1/NFS_SHARE/:/var/lib/docker/volumes/share/_data, data: addr=192.168.178.2,nfsvers=4,hard,timeo=600,retrans=3: network is unreachable"

basically network is unreachable, but yet it is a standard debian installation, following the official docs: https://docs.docker.com/engine/install/debian/

and this is my systemctl file:

# /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target docker.socket firewalld.service containerd.service time-set.target
Wants=network-online.target containerd.service
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutStartSec=0
RestartSec=2
Restart=always

# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3

# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity

# Comment TasksMax if your systemd version does not support it.
# Only systemd 226 and above support this option.
TasksMax=infinity

# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes

# kill only the docker process, not all processes in the cgroup
KillMode=process
OOMScoreAdjust=-500

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/docker.service.d/override.conf
[Unit]
After=remote-fs.target
Wants=remote-fs.target

Hope this helps debugging it :)

Jan 30 '24 01:01 the-hotmann

Just wanted to add:

the very same also happens when using SMB/CIFS mounts/volumes. I assume this applies to all network-based mounts/volumes.

I also confirmed the very same on another server. Just to make sure, it is not because of any special config on my end.

Jan 30 '24 09:01 the-hotmann

Hm... so I just realised that my suggestion of using systemd for this would work if the host had NFS mounts for these filesystems, but if the host does not have those, systemd would not be aware of them, so won't take them into account.

Does this work if your host has a mount from these server(s)? (also see https://geraldonit.com/2023/02/25/auto-mount-nfs-share-using-systemd/)

In that case, it's also worth considering setting up the NFS mount on the host, and instead of using an NFS mount for the container;

    driver_opts:
      type: "nfs"
      o: "addr=192.168.178.2,nfsvers=4"
      device: ":/volume1/NFS_SHARE/"

To use a "bind" mount, using the NFS mounted path from the host. This could be a regular bind-mount, or a volume with the relevant mount-options set https://github.com/moby/moby/issues/19990#issuecomment-248955005, something like;

    driver_opts:
      type: "none"
      o: "bind"
      device: "/path/to/nfs-mount/on-host/"

Jan 30 '24 10:01 thaJeztah

Does this work if your host has a mount from these server(s)?

Sorry I don't understand this question.

But here something I have tried before and it worked:

setting up NFS-Mount to a volume /mnt/NFS_MOUNT/ (with /etc/fstab)
mapping it into the container just like any other folder.

This works, but this is not what I desire, since I want the mount and the whole connection also be transferable via docker compose etc.

I feel like this volumes in docker do have a general/structural problem of not waiting for the mount to be active. Can btw anyone of you confirm AND replicate this bug on your side?

Jan 30 '24 17:01 the-hotmann

Same problem here. In my case, Proxmox with a Debian VM (docker/portainer) connecting to a Synology NAS over NFSv4. After reboot, 8 of 30 containers fail to start and unsurprisingly they're all the ones with NFS mounts. The containers spin right up when I click Start in portainer though.

So far I've tried setting different restart: settings and depends-on:, neither of which is working. I really don't want to touch the host, much rather get it working in Docker alone.

Still scouring the internet for a solution :)

Feb 04 '24 03:02 mblanco4x4

@thaJeztah @vvoland are there any news on this, or is there something I can do to help?

Feb 12 '24 22:02 the-hotmann

This problem still exists till this day.

Aug 17 '24 13:08 the-hotmann

We're already depending on the network-online.target systemd target for the daemon to start.

If that doesn't work out of the box on your system, you might need to adjust your network-online conditions to suit your configuration: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

Note that:

network-online.target will ensure that all configured network devices are up and have an IP address assigned before the service is started. ... The right "wait" service must be enabled too (NetworkManager-wait-online.service if NetworkManager is used to configure the network, systemd-networkd-wait-online.service if systemd-networkd is used, etc.)

Aug 20 '24 09:08 vvoland

moby moby copied to clipboard

[BUG] ALL containers which use NFS volumes do not start after reboot

Description

Reproduce

Expected behavior

docker version

docker info

Additional Info

moby
moby copied to clipboard