NFS mounts in /etc/fstab and cloud-init may cause boot hang
This bug was originally filed in Launchpad as LP: #1913354
Launchpad details
affected_projects = [] assignee = None assignee_name = None date_closed = None date_created = 2021-01-26T23:35:15.552578+00:00 date_fix_committed = None date_fix_released = None id = 1913354 importance = medium is_complete = False lp_url = https://bugs.launchpad.net/cloud-init/+bug/1913354 milestone = None owner = hggdh2 owner_name = C de-Avillez private = False status = triaged submitter = hggdh2 submitter_name = C de-Avillez tags = [] duplicates = []
Launchpad user C de-Avillez(hggdh2) wrote on 2021-01-26T23:35:15.552578+00:00
Azure, RHEL 7.8, 7.9 and OEL 7.8, 7.9.
On OEL 7.8 cloud-init is cloud-init-18.5-6.el7.x86_64
On both OEL and RHel 7.* (certainly 7.8 and 7.9), if we have a NFS mount in /etc/fstab (unknown if this applies to NFSv4), then boot may not complete. The end result is a hang, and the system is inaccessible from SSH or serial console login.
All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.
This happens because rpc.statd and rpc.statd-notify have the following dependencies declared:
rcp-statd.service
[Unit] Description=NFS status monitor for NFSv2/3 locking. DefaultDependencies=no Conflicts=umount.target Requires=nss-lookup.target rpcbind.socket Wants=network-online.target # <--- After=network-online.target nss-lookup.target rpcbind.socket # <---
PartOf=nfs-utils.service
Wants=nfs-config.service After=nfs-config.service
[Service] Environment=RPC_STATD_NO_NOTIFY=1 EnvironmentFile=-/run/sysconfig/nfs-utils Type=forking PIDFile=/var/run/rpc.statd.pid ExecStart=/usr/sbin/rpc.statd $STATDARGS
rpc-statd-notify.service:
[Unit] Description=Notify NFS peers of a restart DefaultDependencies=no Wants=network-online.target # <--- After=local-fs.target network-online.target nss-lookup.target # <---
Do not start up in HA environments
ConditionPathExists=!/var/lib/nfs/statd/sm.ha
if we run an nfs server, it needs to be running before we
tell clients that it has restarted.
After=nfs-server.service
PartOf=nfs-utils.service
Wants=nfs-config.service After=nfs-config.service
[Service] EnvironmentFile=-/run/sysconfig/nfs-utils Type=forking ExecStart=-/usr/sbin/sm-notify $SMNOTIFYARGS
while cloud-init.service is:
[Unit] Description=Initial cloud-init job (metadata service crawler) Wants=cloud-init-local.service Wants=sshd-keygen.service Wants=sshd.service After=cloud-init-local.service After=NetworkManager.service network.service Before=network-online.target # <--- Before=sshd-keygen.service Before=sshd.service Before=systemd-user-sessions.service ConditionPathExists=!/etc/cloud/cloud-init.disabled ConditionKernelCommandLine=!cloud-init=disabled
[Service] Type=oneshot ExecStart=/usr/bin/cloud-init init RemainAfterExit=yes TimeoutSec=0
Output needs to appear in instance console output
StandardOutput=journal+console
[Install] WantedBy=cloud-init.target
So cloud-init is to be started before network-online.target, while rpc-statd* are to be started after network-online.target.
CX has demonstrated this to my satisfaction.
I see a few possible paths here:
- CX has to change the (rpc-statd|rpc-statd-notify).service so that they now state:
Before=network-online.target #Wants=network-online.target #After network-online.target
- CX has to change cloud-init.service so that it now states:
Wants=network-online.target After=network-online.target #Before=network-online.target
- CX removes the NFS mount from /etc/fstab, and adds it as a systemd .mount unit
CX opted for change #1 above, and now sees no boot issues.
There is a Red Hat bug about that: https://bugzilla.redhat.com/show_bug.cgi?id=1858930, but it was closed WONTFIX because... support for RHEL7 ended :-(. . I also tried to search on bugzilla and Launchpad for related bugs on RHEL(7|8), but did not find any.
Launchpad user Dan Watkins(oddbloke) wrote on 2021-01-29T16:53:00.464810+00:00
Thanks for using cloud-init and taking the time to file a bug report!
All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.
I don't see any evidence presented of a deadlock. The NFS units presumably should run after networking is available (that's what the N stands for, after all) and cloud-init.service is the first opportunity to run user's configuration, so having it run at a predictable point in boot before "most" things is desirable.
I don't doubt that you're hitting an issue, but we don't have enough information about it. Can you explain in a little more detail what the exact issue you're seeing is? If possible, please also include the output of cloud-init collect-logs from an affected instance, then move this back to New.
Thanks again!
Dan
Launchpad user C de-Avillez(hggdh2) wrote on 2021-01-29T21:10:19.946775+00:00
Hi Dan, thank you for looking into this.
The issue seems to be driven by cloud-init.service starting before network is actually available (in the cloud-init.service definition, we have "Before=network-online.target"). But the rpc-statd* service definitions are set to only start after network is fully available (Wants=network-online.target, After=network-online.target.
So... c-i starts, and drives the NFS mounts. But the dependent services (again, specifically the rpc-statd*.service) will not start until we reach the required target.
I will check with CXs what we can post in a public bug (or I may move this bug to private) due to PII restrictions.
Launchpad user Rakesh Ginjupalli(linuxelf001) wrote on 2021-02-02T03:55:50.275225+00:00
Can rpc-statd service change to run after network.target instead of network-online.target?
Launchpad user C de-Avillez(hggdh2) wrote on 2021-02-15T20:29:38.962144+00:00
Moving bug to Private, in preparation for logs upload.
Launchpad user C de-Avillez(hggdh2) wrote on 2021-02-22T21:19:55.134180+00:00
Here we have the cloud-init logs, provided by CX, and -- I think -- with PII excised. CX asks this bug to be kept private until the c-i logs are deleted.
Launchpad user C de-Avillez(hggdh2) wrote on 2021-02-23T16:58:50.208370+00:00
updated logs.
Launchpad user Richard Harding(rharding) wrote on 2021-03-02T17:21:07.903999+00:00
Spoke with Anh today who has a couple of other ideas with mount options and will reply.
Launchpad user C de-Avillez(hggdh2) wrote on 2021-04-20T14:37:14.104885+00:00
Had a chat with Anh. Will let both customer and my colleagues know the current status.
A summary:
- no solution from RH for this, RHEL 7 already is EOL-ed;
- NVFv4 does not have this problem, since it does not have the same dependencies on rpc-statd*
- a patch is being discussed between us and c-i upstream
- if (and when) the patch is committed, it will still take around 12 months for RHEL to incorporate it to RHEL 8
Launchpad user C de-Avillez(hggdh2) wrote on 2022-04-12T16:12:41.664533+00:00
Deleted attachment and moved the bug to PUBLIC. This has completely stalled since my last comment.
Launchpad user James Falcon(falcojr) wrote on 2022-04-18T13:21:06.660626+00:00
"a patch is being discussed between us and c-i upstream"
Do you happen to know the details of what was discussed? Unfortunately the upstream folks initially involved are no longer involved with the project.