dracut-ng icon indicating copy to clipboard operation
dracut-ng copied to clipboard

Dracut immediately stops emergency.service when emergency.target is invoked by gpt-auto-root timeout

Open RomanValov opened this issue 11 months ago • 5 comments
trafficstars

Describe the bug

I was debugging weird systemd-gpt-auto-root behavior where it prevents dracut hostonly images to boot. And bumped into the bug on gpt-auto-root device timed out. Systemd intended to enter emergency mode in case. Based on log messages emergency.target and emergency.service are actually started but then immediately stopped and no emergency shell were opened.

Distribution used / Dracut version

Fedora 41 (dracut 103-3.fc41) Ubuntu 24.10 (103-1ubuntu3) Debian testing (105-2)

Init system

systemd

To Reproduce

  1. Install Fedora Linux using LVM+encryption setup
  2. Generate dracut image using --hostonly --hostonly-cmdline
  3. Optionally use systemd-cryptenroll to enroll tpm2 based key to avoid typing LUKS key every time
  4. Configure bootloader entry that uses generated hostonly image
  5. Ensure there are no root= or systemd.gpt_auto=0 parameters passed directly to kernel by bootloader. Empty cmdline/options should actually work fine with hostonly generated image
  6. Reboot
  7. Wait until gpt-auto-root device timed out

Expected behavior

System enters emergency mode and emergency shell is opened.

Observed behavior

System enters emergency mode but immediately exits it. No shell is opened. Boot process is stuck.

Additional context

I have tested with different distributions but the issue is more or less the same. Initially there were noticeable difference that Ubuntu distro goes into loop waiting for gpt-auto-root (and failing) over and over again. Whereas Fedora distro just stucks after first time out. However when I attempted to collect logs from serial console and added console=tty0 console=ttyS0 to kernel command line, Fedora started behaving similarly to Ubuntu (running loop on tty0, but output on ttyS0 ended up with first time out). Anyway these details are just FYI and out of scope of the issue.

Logs

fedora.console.log

Would appreciate any hints how to debug the issue further. In particular how to find the reason why emergency.target/emergency.service are stopped.

RomanValov avatar Dec 15 '24 06:12 RomanValov

Why boot is stuck

As described in related issue (https://github.com/dracut-ng/dracut-ng/issues/1062) there are two systemd generators which produce conflicting systemd configuration:

From systemd-gpt-auto-generator:

dev-gpt\x2dauto\x2droot.device:
	RequiredBy: initrd-root-device.target
		OnFailure: emergency.target
		WantedBy: initrd.target

emergency.target:
	Requires: emergency.service
	After: emergency.service

emergency.service:
	ExecStart: /bin/dracut-emergency
	ExecStopPost: -/usr/bin/systemctl --no-block default

From dracut-rootfs-generator:

dev-{root}.device:
	WantedBy: initrd.target
	RequiredBy: sysroot.mount
		RequiredBy: initrd-root-fs.target:
			RequiredBy: initrd-parse-etc.target
			WantedBy: initrd.target
			WantedBy: initrd-switch-root.target


initrd-parse-etc.service:
	After: initrd-root-fs.target
	ExecStart: systemctl --no-block start initrd-cleanup.service

initrd-cleanup.service:
	After: initrd-root-fs.target
	After: initrd.target
	ExecStart: `

Full configuration dump could be found in archive: systemd.zip (check systemd-analyze.dump.log)

Step-by-step boot flow:

  1. dev-{root}.device generated by dracut is found
  2. sysroot.mount is then triggered and initrd-root-fs.target is reached
  3. initrd-parse-etc.target and initrd-parse-etc.service are starting
  4. initrd-parse-etc.service asynchronously starts initrd-cleanup.service
  5. At the point initrd-cleanup.service is awaiting initrd.target
  6. dev-gpt\x2dautox\2droot.device is awaiting until timeout occurs
  7. initrd-root-device.target and initrd.target are failed due to dependency
  8. Failure causes emergency.target and emergency.service to be started
  9. Same time initrd-cleanup.service is unblocked due to initrd.target failure
  10. It runs systemctl --no-block isolate initrd-switch-root.target to continue boot to system
  11. Isolate to initrd-switch-root.target causes all unneeded units (including emergency.service) to be stopped
  12. emergency.service runs it's ExecStopPost stanza which is: /usr/bin/systemctl --no-block default
  13. It's actually does an isolate to initrd.target which now stops initrd-switch-root.target and restarts a loop

Shortly, there are two isolate operations, one is driving towards system boot and another towards initqueue restart and they are trying to stop each other.

RomanValov avatar Dec 24 '24 12:12 RomanValov

Despite the fact that issue occurs due to conflicting systemd configuration I would consider it to be only a condition to reproduce the issue with inability to spawn emergency shell.

In particular initrd-switch-root.target is designed to await emergency shell to be completed and isolation should not proceed until user exits the shell:

initrd-switch-root.target:
	After: emergency.target
	After: emergency.service

It looks it was broken with the commit 4c2d98c75b0dd3dad45430becb78c9d40bc6be1b where type of emergency.service were changed from Type=onehsot to Type=idle. The type was changed on purpose. When I have reverted it to Type=oneshot the boot flow now correctly opens the shell. But when shell exits the boot flow is stuck. Here is a log of boot process in case: fedora.oneshot.log

However I was able to workaround the stuck with ordering stanzas added to initrd-cleanup.service:

initrd-cleanup.service:
	After: emergency.target
	After: emergency.service

Now boot flow continues after emergency shell exit. I still following error messages however. Here is a log of boot process: fedora.cleanup.log

Here is a diff between vanilla and patched initrds:

--- ./initfs.vanilla/usr/lib/systemd/system/emergency.service	2024-12-24 18:12:13.588221788 +0600
+++ ./initfs.patched/usr/lib/systemd/system/emergency.service	2024-12-24 18:23:14.002746317 +0600
@@ -16,7 +16,7 @@ Environment=NEWROOT=/sysroot
 WorkingDirectory=/
 ExecStart=/bin/dracut-emergency
 ExecStopPost=-/usr/bin/systemctl --fail --no-block default
-Type=idle
+Type=oneshot
 StandardInput=tty-force
 StandardOutput=inherit
 StandardError=inherit
--- ./initfs.vanilla/usr/lib/systemd/system/initrd-cleanup.service	2024-12-24 18:12:13.586221803 +0600
+++ ./initfs.patched/usr/lib/systemd/system/initrd-cleanup.service	2024-12-24 18:23:14.000746333 +0600
@@ -13,7 +13,7 @@ DefaultDependencies=no
 AssertPathExists=/etc/initrd-release
 OnFailure=emergency.target
 OnFailureJobMode=replace-irreversibly
-After=initrd-root-fs.target initrd-fs.target initrd.target
+After=initrd-root-fs.target initrd-fs.target initrd.target emergency.service emergency.target
 
 [Service]
 Type=oneshot
--- ./initfs.vanilla/usr/lib/systemd/system/rescue.service	2024-12-24 18:12:13.588221788 +0600
+++ ./initfs.patched/usr/lib/systemd/system/rescue.service	2024-12-24 18:23:14.002746317 +0600
@@ -16,7 +16,7 @@ Environment=NEWROOT=/sysroot
 WorkingDirectory=/
 ExecStart=/bin/dracut-emergency
 ExecStopPost=-/usr/bin/systemctl --fail --no-block default
-Type=idle
+Type=oneshot
 StandardInput=tty-force
 StandardOutput=inherit
 StandardError=inherit

RomanValov avatar Dec 24 '24 12:12 RomanValov

It looks it was broken with the commit 4c2d98c

CC @mwilck to help with the discussion. Thanks Martin !

LaszloGombos avatar Dec 26 '24 16:12 LaszloGombos

It looks it was broken with the commit 4c2d98c where type of emergency.service were changed from Type=onehsot to Type=idle.

Have you looked at the time stamp of that commit? I find it rather unlikely that this 7-year old commit has caused a regression now.

This was 7 years ago, so maybe systemd's behavior has changed and the "transaction is destructive" error doesn't occur any more. But I won't ack your change unless I see proof that this is the case. While my past use case is not uncommon, it isn't easily recreated on purpose. I guess it can be achieved with crypto if you just don't enter the pass phrase for the root FS, let it time out, and then activate it manually from the emergency shell.

I am wondering what you're trying to achieve. What's the purpose of trying to activate dev-gpt\x2dautox\2droot.device and failing to do so, when the actual root device is already mounted and the initrd is about to switch root?

mwilck avatar Jan 08 '25 19:01 mwilck

That said, your second hunk (the one that orders initrd-cleanup.service after emergency.service) makes a lot of sense to me.

mwilck avatar Jan 08 '25 19:01 mwilck