Fedora Lima image gets stuck in reboot loop if mirror is unavailable
Description
@jandubois suggested I raise an issue here as well with regards to https://github.com/runfinch/finch/issues/1632
It appears that if the Fedora mirrors are unavailable or inaccessible, dnf needs-restarting will return a non-zero exit code which the cloudinit scripts determine to be the same as needing a reboot.
This results in no visible logging (that I can see) and the VM getting stuck in a reboot loop, chewing up significant host resources and being very difficult to debug.
From the original issue, this was my analysis:
When running using a Corporate Proxy that has its own SSL certificates, bringing up a Finch VM is problematic. Right now, upon first boot, we observe that finch vm init just hangs for about 15 minutes and then crashes.
Upon pulling all of this configuration and code to pieces, I found that the problem lies within the ISO that is downloaded. The script causing us problems is the following:
#!/bin/sh
# SPDX-FileCopyrightText: Copyright The Lima Authors
# SPDX-License-Identifier: Apache-2.0
set -eux
# Check if cloud-init forgot to reboot_if_required
# (only implemented for apt at the moment, not dnf)
if command -v dnf >/dev/null 2>&1; then
# dnf-utils needs to be installed, for needs-restarting
if dnf -h needs-restarting >/dev/null 2>&1; then
# needs-restarting returns "false" if needed (!)
if ! dnf needs-restarting -r >/dev/null 2>&1; then
systemctl reboot
fi
fi
fi
Specifically, take not of the if ! dnf needs-restarting -r >/dev/null 2>&1; then systemctl reboot. Whilst it is true that dnf needs-restarting will return a non-zero exit code if we need to reboot, it also returns a non-zero exit code if it failed to complete.
It turns out that dnf needs-restarting dials out to the Fedora repository mirrors... under a corporate proxy that operates on L3/L4 (e.g. as part of a ZTNA), this won't work. You'll just get the following output (which is somewhat unhelpfully suppressed and sent to /dev/null here):
└─[127] <> docker run --rm -it fedora
[root@97829c00283b /]# dnf needs-restarting
Updating and loading repositories:
Fedora 42 - aarch64 - Updates ???% [ <=> ] | 0.0 B/s | 0.0 B | 00m01s
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
...
This then exits with a non-zero exit code.
This means if you have no side-loaded CA certificates, finch vm init will get stuck in a loop of repeatedly restarting the VM every 5 seconds or so, while providing no output of what the issue is, since everything is sent to /dev/null.
Thanks for the detailed bug report! We can maybe check connectivity explicitly first, before trying other dnf commands. And improve the logging of errors, without spamming with expected output.
dnf check-update (returns error code 100 if there are updates, error code 1 on errors)
dnf needs-restarting (returns error code 1 if there are updates, pending reboot)
I think this is reasonable.
Another option might be to check what the exit codes for dnf needs-restarting are. I haven't checked, but I'd hope they use distinct documented exit codes for each scenario. In that case it'd be worth checking for a specific value instead of the non-zero check currently in place.
Also, we should only do the reboot when UpgradePackages=true (since it works in co-operation with cloud-init)
Another option might be to check what the exit codes for
dnf needs-restartingare. I haven't checked, but I'd hope
The code is horrible, unfortunately. It doesn't use exit codes, but throws exceptions if the check is successful...
if need_reboot:
print(_('Core libraries or services have been updated '
'since boot-up:'))
for name in sorted(need_reboot):
print(' * %s' % name)
print()
print(_('Reboot is required to fully utilize these updates.'))
print(_('More information:'),
'https://access.redhat.com/solutions/27943')
raise dnf.exceptions.Error() # Sets exit code 1
else:
print(_('No core libraries or services have been updated '
'since boot-up.'))
print(_('Reboot should not be necessary.'))
return None
https://github.com/rpm-software-management/dnf-plugins-core/blob/master/plugins/needs_restarting.py
The root cause for the need for this boot script, is that reboot_if_required is silently ignored by cloud-init...
- https://github.com/lima-vm/lima/pull/2119#issuecomment-1882649733
The PR #4442 has two mitigations:
-
Only reboot when actually using upgradePackages setting
-
Check connectivity with repo, before checking the packages
Can you verify if it works for you?
Looks to be working, thanks