This bug was originally filed in Launchpad as LP: #1913354

Launchpad details

affected_projects = []
assignee = None
assignee_name = None
date_closed = None
date_created = 2021-01-26T23:35:15.552578+00:00
date_fix_committed = None
date_fix_released = None
id = 1913354
importance = medium
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1913354
milestone = None
owner = hggdh2
owner_name = C de-Avillez
private = False
status = triaged
submitter = hggdh2
submitter_name = C de-Avillez
tags = []
duplicates = []

Launchpad user C de-Avillez(hggdh2) wrote on 2021-01-26T23:35:15.552578+00:00

Azure, RHEL 7.8, 7.9 and OEL 7.8, 7.9.

On OEL 7.8 cloud-init is cloud-init-18.5-6.el7.x86_64

On both OEL and RHel 7.* (certainly 7.8 and 7.9), if we have a NFS mount in /etc/fstab (unknown if this applies to NFSv4), then boot may not complete. The end result is a hang, and the system is inaccessible from SSH or serial console login.

All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.

This happens because rpc.statd and rpc.statd-notify have the following dependencies declared:

rcp-statd.service

[Unit] Description=NFS status monitor for NFSv2/3 locking. DefaultDependencies=no Conflicts=umount.target Requires=nss-lookup.target rpcbind.socket Wants=network-online.target # <--- After=network-online.target nss-lookup.target rpcbind.socket # <---

PartOf=nfs-utils.service

Wants=nfs-config.service After=nfs-config.service

[Service] Environment=RPC_STATD_NO_NOTIFY=1 EnvironmentFile=-/run/sysconfig/nfs-utils Type=forking PIDFile=/var/run/rpc.statd.pid ExecStart=/usr/sbin/rpc.statd $STATDARGS

rpc-statd-notify.service:

[Unit] Description=Notify NFS peers of a restart DefaultDependencies=no Wants=network-online.target # <--- After=local-fs.target network-online.target nss-lookup.target # <---

Do not start up in HA environments

ConditionPathExists=!/var/lib/nfs/statd/sm.ha

if we run an nfs server, it needs to be running before we

tell clients that it has restarted.

After=nfs-server.service

PartOf=nfs-utils.service

Wants=nfs-config.service After=nfs-config.service

[Service] EnvironmentFile=-/run/sysconfig/nfs-utils Type=forking ExecStart=-/usr/sbin/sm-notify $SMNOTIFYARGS

while cloud-init.service is:

[Unit] Description=Initial cloud-init job (metadata service crawler) Wants=cloud-init-local.service Wants=sshd-keygen.service Wants=sshd.service After=cloud-init-local.service After=NetworkManager.service network.service Before=network-online.target # <--- Before=sshd-keygen.service Before=sshd.service Before=systemd-user-sessions.service ConditionPathExists=!/etc/cloud/cloud-init.disabled ConditionKernelCommandLine=!cloud-init=disabled

[Service] Type=oneshot ExecStart=/usr/bin/cloud-init init RemainAfterExit=yes TimeoutSec=0

Output needs to appear in instance console output

StandardOutput=journal+console

[Install] WantedBy=cloud-init.target

So cloud-init is to be started before network-online.target, while rpc-statd* are to be started after network-online.target.

CX has demonstrated this to my satisfaction.

I see a few possible paths here:

CX has to change the (rpc-statd|rpc-statd-notify).service so that they now state:

Before=network-online.target #Wants=network-online.target #After network-online.target

CX has to change cloud-init.service so that it now states:

Wants=network-online.target After=network-online.target #Before=network-online.target

CX removes the NFS mount from /etc/fstab, and adds it as a systemd .mount unit

CX opted for change #1 above, and now sees no boot issues.

There is a Red Hat bug about that: https://bugzilla.redhat.com/show_bug.cgi?id=1858930, but it was closed WONTFIX because... support for RHEL7 ended :-(. . I also tried to search on bugzilla and Launchpad for related bugs on RHEL(7|8), but did not find any.

May 12 '23 11:05 ubuntu-server-builder

Launchpad user Dan Watkins(oddbloke) wrote on 2021-01-29T16:53:00.464810+00:00

Thanks for using cloud-init and taking the time to file a bug report!

All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.

I don't see any evidence presented of a deadlock. The NFS units presumably should run after networking is available (that's what the N stands for, after all) and cloud-init.service is the first opportunity to run user's configuration, so having it run at a predictable point in boot before "most" things is desirable.

I don't doubt that you're hitting an issue, but we don't have enough information about it. Can you explain in a little more detail what the exact issue you're seeing is? If possible, please also include the output of cloud-init collect-logs from an affected instance, then move this back to New.

Thanks again!

Dan

May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-01-29T21:10:19.946775+00:00

Hi Dan, thank you for looking into this.

The issue seems to be driven by cloud-init.service starting before network is actually available (in the cloud-init.service definition, we have "Before=network-online.target"). But the rpc-statd* service definitions are set to only start after network is fully available (Wants=network-online.target, After=network-online.target.

So... c-i starts, and drives the NFS mounts. But the dependent services (again, specifically the rpc-statd*.service) will not start until we reach the required target.

I will check with CXs what we can post in a public bug (or I may move this bug to private) due to PII restrictions.

May 12 '23 11:05 ubuntu-server-builder

Launchpad user Rakesh Ginjupalli(linuxelf001) wrote on 2021-02-02T03:55:50.275225+00:00

Can rpc-statd service change to run after network.target instead of network-online.target?

May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-02-15T20:29:38.962144+00:00

Moving bug to Private, in preparation for logs upload.

May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-02-22T21:19:55.134180+00:00

Here we have the cloud-init logs, provided by CX, and -- I think -- with PII excised. CX asks this bug to be kept private until the c-i logs are deleted.

May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-02-23T16:58:50.208370+00:00

updated logs.

May 12 '23 11:05 ubuntu-server-builder

Launchpad user Richard Harding(rharding) wrote on 2021-03-02T17:21:07.903999+00:00

Spoke with Anh today who has a couple of other ideas with mount options and will reply.

May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-04-20T14:37:14.104885+00:00

Had a chat with Anh. Will let both customer and my colleagues know the current status.

A summary:

no solution from RH for this, RHEL 7 already is EOL-ed;
NVFv4 does not have this problem, since it does not have the same dependencies on rpc-statd*
a patch is being discussed between us and c-i upstream
if (and when) the patch is committed, it will still take around 12 months for RHEL to incorporate it to RHEL 8

May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2022-04-12T16:12:41.664533+00:00

Deleted attachment and moved the bug to PUBLIC. This has completely stalled since my last comment.

May 12 '23 11:05 ubuntu-server-builder

Launchpad user James Falcon(falcojr) wrote on 2022-04-18T13:21:06.660626+00:00

"a patch is being discussed between us and c-i upstream"

Do you happen to know the details of what was discussed? Unfortunately the upstream folks initially involved are no longer involved with the project.

May 12 '23 11:05 ubuntu-server-builder

NFS mounts in /etc/fstab and cloud-init may cause boot hang

rcp-statd.service

rpc-statd-notify.service:

Do not start up in HA environments

if we run an nfs server, it needs to be running before we

tell clients that it has restarted.

Output needs to appear in instance console output