dracut-ng
dracut-ng copied to clipboard
Dracut immediately stops emergency.service when emergency.target is invoked by gpt-auto-root timeout
Describe the bug
I was debugging weird systemd-gpt-auto-root behavior where it prevents dracut hostonly images to boot. And bumped into the bug on gpt-auto-root device timed out. Systemd intended to enter emergency mode in case. Based on log messages emergency.target and emergency.service are actually started but then immediately stopped and no emergency shell were opened.
Distribution used / Dracut version
Fedora 41 (dracut 103-3.fc41) Ubuntu 24.10 (103-1ubuntu3) Debian testing (105-2)
Init system
systemd
To Reproduce
- Install Fedora Linux using LVM+encryption setup
- Generate dracut image using
--hostonly --hostonly-cmdline - Optionally use systemd-cryptenroll to enroll tpm2 based key to avoid typing LUKS key every time
- Configure bootloader entry that uses generated hostonly image
- Ensure there are no
root=orsystemd.gpt_auto=0parameters passed directly to kernel by bootloader. Empty cmdline/options should actually work fine with hostonly generated image - Reboot
- Wait until gpt-auto-root device timed out
Expected behavior
System enters emergency mode and emergency shell is opened.
Observed behavior
System enters emergency mode but immediately exits it. No shell is opened. Boot process is stuck.
Additional context
I have tested with different distributions but the issue is more or less the same. Initially there were noticeable difference that Ubuntu distro goes into loop waiting for gpt-auto-root (and failing) over and over again. Whereas Fedora distro just stucks after first time out. However when I attempted to collect logs from serial console and added console=tty0 console=ttyS0 to kernel command line, Fedora started behaving similarly to Ubuntu (running loop on tty0, but output on ttyS0 ended up with first time out). Anyway these details are just FYI and out of scope of the issue.
Logs
Would appreciate any hints how to debug the issue further. In particular how to find the reason why emergency.target/emergency.service are stopped.
Why boot is stuck
As described in related issue (https://github.com/dracut-ng/dracut-ng/issues/1062) there are two systemd generators which produce conflicting systemd configuration:
From systemd-gpt-auto-generator:
dev-gpt\x2dauto\x2droot.device:
RequiredBy: initrd-root-device.target
OnFailure: emergency.target
WantedBy: initrd.target
emergency.target:
Requires: emergency.service
After: emergency.service
emergency.service:
ExecStart: /bin/dracut-emergency
ExecStopPost: -/usr/bin/systemctl --no-block default
From dracut-rootfs-generator:
dev-{root}.device:
WantedBy: initrd.target
RequiredBy: sysroot.mount
RequiredBy: initrd-root-fs.target:
RequiredBy: initrd-parse-etc.target
WantedBy: initrd.target
WantedBy: initrd-switch-root.target
initrd-parse-etc.service:
After: initrd-root-fs.target
ExecStart: systemctl --no-block start initrd-cleanup.service
initrd-cleanup.service:
After: initrd-root-fs.target
After: initrd.target
ExecStart: `
Full configuration dump could be found in archive: systemd.zip (check systemd-analyze.dump.log)
Step-by-step boot flow:
dev-{root}.devicegenerated by dracut is foundsysroot.mountis then triggered andinitrd-root-fs.targetis reachedinitrd-parse-etc.targetandinitrd-parse-etc.serviceare startinginitrd-parse-etc.serviceasynchronously startsinitrd-cleanup.service- At the point
initrd-cleanup.serviceis awaitinginitrd.target dev-gpt\x2dautox\2droot.deviceis awaiting until timeout occursinitrd-root-device.targetandinitrd.targetare failed due to dependency- Failure causes
emergency.targetandemergency.serviceto be started - Same time
initrd-cleanup.serviceis unblocked due toinitrd.targetfailure - It runs
systemctl --no-block isolate initrd-switch-root.targetto continue boot to system - Isolate to
initrd-switch-root.targetcauses all unneeded units (includingemergency.service) to be stopped emergency.serviceruns it's ExecStopPost stanza which is:/usr/bin/systemctl --no-block default- It's actually does an isolate to
initrd.targetwhich now stopsinitrd-switch-root.targetand restarts a loop
Shortly, there are two isolate operations, one is driving towards system boot and another towards initqueue restart and they are trying to stop each other.
Despite the fact that issue occurs due to conflicting systemd configuration I would consider it to be only a condition to reproduce the issue with inability to spawn emergency shell.
In particular initrd-switch-root.target is designed to await emergency shell to be completed and isolation should not proceed until user exits the shell:
initrd-switch-root.target:
After: emergency.target
After: emergency.service
It looks it was broken with the commit 4c2d98c75b0dd3dad45430becb78c9d40bc6be1b where type of emergency.service were changed from Type=onehsot to Type=idle. The type was changed on purpose. When I have reverted it to Type=oneshot the boot flow now correctly opens the shell. But when shell exits the boot flow is stuck. Here is a log of boot process in case: fedora.oneshot.log
However I was able to workaround the stuck with ordering stanzas added to initrd-cleanup.service:
initrd-cleanup.service:
After: emergency.target
After: emergency.service
Now boot flow continues after emergency shell exit. I still following error messages however. Here is a log of boot process: fedora.cleanup.log
Here is a diff between vanilla and patched initrds:
--- ./initfs.vanilla/usr/lib/systemd/system/emergency.service 2024-12-24 18:12:13.588221788 +0600
+++ ./initfs.patched/usr/lib/systemd/system/emergency.service 2024-12-24 18:23:14.002746317 +0600
@@ -16,7 +16,7 @@ Environment=NEWROOT=/sysroot
WorkingDirectory=/
ExecStart=/bin/dracut-emergency
ExecStopPost=-/usr/bin/systemctl --fail --no-block default
-Type=idle
+Type=oneshot
StandardInput=tty-force
StandardOutput=inherit
StandardError=inherit
--- ./initfs.vanilla/usr/lib/systemd/system/initrd-cleanup.service 2024-12-24 18:12:13.586221803 +0600
+++ ./initfs.patched/usr/lib/systemd/system/initrd-cleanup.service 2024-12-24 18:23:14.000746333 +0600
@@ -13,7 +13,7 @@ DefaultDependencies=no
AssertPathExists=/etc/initrd-release
OnFailure=emergency.target
OnFailureJobMode=replace-irreversibly
-After=initrd-root-fs.target initrd-fs.target initrd.target
+After=initrd-root-fs.target initrd-fs.target initrd.target emergency.service emergency.target
[Service]
Type=oneshot
--- ./initfs.vanilla/usr/lib/systemd/system/rescue.service 2024-12-24 18:12:13.588221788 +0600
+++ ./initfs.patched/usr/lib/systemd/system/rescue.service 2024-12-24 18:23:14.002746317 +0600
@@ -16,7 +16,7 @@ Environment=NEWROOT=/sysroot
WorkingDirectory=/
ExecStart=/bin/dracut-emergency
ExecStopPost=-/usr/bin/systemctl --fail --no-block default
-Type=idle
+Type=oneshot
StandardInput=tty-force
StandardOutput=inherit
StandardError=inherit
It looks it was broken with the commit 4c2d98c
CC @mwilck to help with the discussion. Thanks Martin !
It looks it was broken with the commit 4c2d98c where type of
emergency.servicewere changed fromType=onehsottoType=idle.
Have you looked at the time stamp of that commit? I find it rather unlikely that this 7-year old commit has caused a regression now.
This was 7 years ago, so maybe systemd's behavior has changed and the "transaction is destructive" error doesn't occur any more. But I won't ack your change unless I see proof that this is the case. While my past use case is not uncommon, it isn't easily recreated on purpose. I guess it can be achieved with crypto if you just don't enter the pass phrase for the root FS, let it time out, and then activate it manually from the emergency shell.
I am wondering what you're trying to achieve. What's the purpose of trying to activate dev-gpt\x2dautox\2droot.device and failing to do so, when the actual root device is already mounted and the initrd is about to switch root?
That said, your second hunk (the one that orders initrd-cleanup.service after emergency.service) makes a lot of sense to me.