lima icon indicating copy to clipboard operation
lima copied to clipboard

Fedora Lima image gets stuck in reboot loop if mirror is unavailable

Open ascopes opened this issue 1 month ago • 6 comments

Description

@jandubois suggested I raise an issue here as well with regards to https://github.com/runfinch/finch/issues/1632

It appears that if the Fedora mirrors are unavailable or inaccessible, dnf needs-restarting will return a non-zero exit code which the cloudinit scripts determine to be the same as needing a reboot.

This results in no visible logging (that I can see) and the VM getting stuck in a reboot loop, chewing up significant host resources and being very difficult to debug.

From the original issue, this was my analysis:


When running using a Corporate Proxy that has its own SSL certificates, bringing up a Finch VM is problematic. Right now, upon first boot, we observe that finch vm init just hangs for about 15 minutes and then crashes.

Upon pulling all of this configuration and code to pieces, I found that the problem lies within the ISO that is downloaded. The script causing us problems is the following:

#!/bin/sh

# SPDX-FileCopyrightText: Copyright The Lima Authors
# SPDX-License-Identifier: Apache-2.0

set -eux

# Check if cloud-init forgot to reboot_if_required
# (only implemented for apt at the moment, not dnf)

if command -v dnf >/dev/null 2>&1; then
	# dnf-utils needs to be installed, for needs-restarting
	if dnf -h needs-restarting >/dev/null 2>&1; then
		# needs-restarting returns "false" if needed (!)
		if ! dnf needs-restarting -r >/dev/null 2>&1; then
			systemctl reboot
		fi
	fi
fi

Specifically, take not of the if ! dnf needs-restarting -r >/dev/null 2>&1; then systemctl reboot. Whilst it is true that dnf needs-restarting will return a non-zero exit code if we need to reboot, it also returns a non-zero exit code if it failed to complete.

It turns out that dnf needs-restarting dials out to the Fedora repository mirrors... under a corporate proxy that operates on L3/L4 (e.g. as part of a ZTNA), this won't work. You'll just get the following output (which is somewhat unhelpfully suppressed and sent to /dev/null here):

└─[127] <> docker run --rm -it fedora                               
[root@97829c00283b /]# dnf needs-restarting
Updating and loading repositories:
 Fedora 42 - aarch64 - Updates                                                                                             ???% [  <=>             ] |   0.0   B/s |   0.0   B |  00m01s
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
>>> Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.fedoraproject.org/metalink?repo=updates-released-f42&arch=aarch64 [SSL certificate problem: u
...

This then exits with a non-zero exit code.

This means if you have no side-loaded CA certificates, finch vm init will get stuck in a loop of repeatedly restarting the VM every 5 seconds or so, while providing no output of what the issue is, since everything is sent to /dev/null.

ascopes avatar Dec 08 '25 08:12 ascopes

Thanks for the detailed bug report! We can maybe check connectivity explicitly first, before trying other dnf commands. And improve the logging of errors, without spamming with expected output.

dnf check-update (returns error code 100 if there are updates, error code 1 on errors)

dnf needs-restarting (returns error code 1 if there are updates, pending reboot)

afbjorklund avatar Dec 08 '25 09:12 afbjorklund

I think this is reasonable.

Another option might be to check what the exit codes for dnf needs-restarting are. I haven't checked, but I'd hope they use distinct documented exit codes for each scenario. In that case it'd be worth checking for a specific value instead of the non-zero check currently in place.

ascopes avatar Dec 08 '25 09:12 ascopes

Also, we should only do the reboot when UpgradePackages=true (since it works in co-operation with cloud-init)

afbjorklund avatar Dec 08 '25 11:12 afbjorklund

Another option might be to check what the exit codes for dnf needs-restarting are. I haven't checked, but I'd hope

The code is horrible, unfortunately. It doesn't use exit codes, but throws exceptions if the check is successful...

            if need_reboot:
                print(_('Core libraries or services have been updated '
                        'since boot-up:'))
                for name in sorted(need_reboot):
                    print('  * %s' % name)
                print()
                print(_('Reboot is required to fully utilize these updates.'))
                print(_('More information:'),
                      'https://access.redhat.com/solutions/27943')
                raise dnf.exceptions.Error()  # Sets exit code 1
            else:
                print(_('No core libraries or services have been updated '
                        'since boot-up.'))
                print(_('Reboot should not be necessary.'))
                return None

https://github.com/rpm-software-management/dnf-plugins-core/blob/master/plugins/needs_restarting.py

The root cause for the need for this boot script, is that reboot_if_required is silently ignored by cloud-init...

  • https://github.com/lima-vm/lima/pull/2119#issuecomment-1882649733

afbjorklund avatar Dec 08 '25 11:12 afbjorklund

The PR #4442 has two mitigations:

  1. Only reboot when actually using upgradePackages setting

  2. Check connectivity with repo, before checking the packages

Can you verify if it works for you?

afbjorklund avatar Dec 10 '25 06:12 afbjorklund

Looks to be working, thanks

ascopes avatar Dec 10 '25 08:12 ascopes