cloud-init icon indicating copy to clipboard operation
cloud-init copied to clipboard

NFS mounts in /etc/fstab and cloud-init may cause boot hang

Open ubuntu-server-builder opened this issue 2 years ago • 10 comments

This bug was originally filed in Launchpad as LP: #1913354

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = None
date_created = 2021-01-26T23:35:15.552578+00:00
date_fix_committed = None
date_fix_released = None
id = 1913354
importance = medium
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1913354
milestone = None
owner = hggdh2
owner_name = C de-Avillez
private = False
status = triaged
submitter = hggdh2
submitter_name = C de-Avillez
tags = []
duplicates = []

Launchpad user C de-Avillez(hggdh2) wrote on 2021-01-26T23:35:15.552578+00:00

Azure, RHEL 7.8, 7.9 and OEL 7.8, 7.9.

On OEL 7.8 cloud-init is cloud-init-18.5-6.el7.x86_64

On both OEL and RHel 7.* (certainly 7.8 and 7.9), if we have a NFS mount in /etc/fstab (unknown if this applies to NFSv4), then boot may not complete. The end result is a hang, and the system is inaccessible from SSH or serial console login.

All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.

This happens because rpc.statd and rpc.statd-notify have the following dependencies declared:

rcp-statd.service

[Unit] Description=NFS status monitor for NFSv2/3 locking. DefaultDependencies=no Conflicts=umount.target Requires=nss-lookup.target rpcbind.socket Wants=network-online.target # <--- After=network-online.target nss-lookup.target rpcbind.socket # <---

PartOf=nfs-utils.service

Wants=nfs-config.service After=nfs-config.service

[Service] Environment=RPC_STATD_NO_NOTIFY=1 EnvironmentFile=-/run/sysconfig/nfs-utils Type=forking PIDFile=/var/run/rpc.statd.pid ExecStart=/usr/sbin/rpc.statd $STATDARGS

rpc-statd-notify.service:

[Unit] Description=Notify NFS peers of a restart DefaultDependencies=no Wants=network-online.target # <--- After=local-fs.target network-online.target nss-lookup.target # <---

Do not start up in HA environments

ConditionPathExists=!/var/lib/nfs/statd/sm.ha

if we run an nfs server, it needs to be running before we

tell clients that it has restarted.

After=nfs-server.service

PartOf=nfs-utils.service

Wants=nfs-config.service After=nfs-config.service

[Service] EnvironmentFile=-/run/sysconfig/nfs-utils Type=forking ExecStart=-/usr/sbin/sm-notify $SMNOTIFYARGS

while cloud-init.service is:

[Unit] Description=Initial cloud-init job (metadata service crawler) Wants=cloud-init-local.service Wants=sshd-keygen.service Wants=sshd.service After=cloud-init-local.service After=NetworkManager.service network.service Before=network-online.target # <--- Before=sshd-keygen.service Before=sshd.service Before=systemd-user-sessions.service ConditionPathExists=!/etc/cloud/cloud-init.disabled ConditionKernelCommandLine=!cloud-init=disabled

[Service] Type=oneshot ExecStart=/usr/bin/cloud-init init RemainAfterExit=yes TimeoutSec=0

Output needs to appear in instance console output

StandardOutput=journal+console

[Install] WantedBy=cloud-init.target

So cloud-init is to be started before network-online.target, while rpc-statd* are to be started after network-online.target.

CX has demonstrated this to my satisfaction.

I see a few possible paths here:

  1. CX has to change the (rpc-statd|rpc-statd-notify).service so that they now state:

Before=network-online.target #Wants=network-online.target #After network-online.target

  1. CX has to change cloud-init.service so that it now states:

Wants=network-online.target After=network-online.target #Before=network-online.target

  1. CX removes the NFS mount from /etc/fstab, and adds it as a systemd .mount unit

CX opted for change #1 above, and now sees no boot issues.

There is a Red Hat bug about that: https://bugzilla.redhat.com/show_bug.cgi?id=1858930, but it was closed WONTFIX because... support for RHEL7 ended :-(. . I also tried to search on bugzilla and Launchpad for related bugs on RHEL(7|8), but did not find any.

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user Dan Watkins(oddbloke) wrote on 2021-01-29T16:53:00.464810+00:00

Thanks for using cloud-init and taking the time to file a bug report!

All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.

I don't see any evidence presented of a deadlock. The NFS units presumably should run after networking is available (that's what the N stands for, after all) and cloud-init.service is the first opportunity to run user's configuration, so having it run at a predictable point in boot before "most" things is desirable.

I don't doubt that you're hitting an issue, but we don't have enough information about it. Can you explain in a little more detail what the exact issue you're seeing is? If possible, please also include the output of cloud-init collect-logs from an affected instance, then move this back to New.

Thanks again!

Dan

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-01-29T21:10:19.946775+00:00

Hi Dan, thank you for looking into this.

The issue seems to be driven by cloud-init.service starting before network is actually available (in the cloud-init.service definition, we have "Before=network-online.target"). But the rpc-statd* service definitions are set to only start after network is fully available (Wants=network-online.target, After=network-online.target.

So... c-i starts, and drives the NFS mounts. But the dependent services (again, specifically the rpc-statd*.service) will not start until we reach the required target.

I will check with CXs what we can post in a public bug (or I may move this bug to private) due to PII restrictions.

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user Rakesh Ginjupalli(linuxelf001) wrote on 2021-02-02T03:55:50.275225+00:00

Can rpc-statd service change to run after network.target instead of network-online.target?

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-02-15T20:29:38.962144+00:00

Moving bug to Private, in preparation for logs upload.

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-02-22T21:19:55.134180+00:00

Here we have the cloud-init logs, provided by CX, and -- I think -- with PII excised. CX asks this bug to be kept private until the c-i logs are deleted.

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-02-23T16:58:50.208370+00:00

updated logs.

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user Richard Harding(rharding) wrote on 2021-03-02T17:21:07.903999+00:00

Spoke with Anh today who has a couple of other ideas with mount options and will reply.

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2021-04-20T14:37:14.104885+00:00

Had a chat with Anh. Will let both customer and my colleagues know the current status.

A summary:

  • no solution from RH for this, RHEL 7 already is EOL-ed;
  • NVFv4 does not have this problem, since it does not have the same dependencies on rpc-statd*
  • a patch is being discussed between us and c-i upstream
  • if (and when) the patch is committed, it will still take around 12 months for RHEL to incorporate it to RHEL 8

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user C de-Avillez(hggdh2) wrote on 2022-04-12T16:12:41.664533+00:00

Deleted attachment and moved the bug to PUBLIC. This has completely stalled since my last comment.

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder

Launchpad user James Falcon(falcojr) wrote on 2022-04-18T13:21:06.660626+00:00

"a patch is being discussed between us and c-i upstream"

Do you happen to know the details of what was discussed? Unfortunately the upstream folks initially involved are no longer involved with the project.

ubuntu-server-builder avatar May 12 '23 11:05 ubuntu-server-builder