nut icon indicating copy to clipboard operation
nut copied to clipboard

nut-driver-enumerator generates systemd service that pulls deprecated systemd-udev-settle

Open yuv420p10le opened this issue 1 year ago • 3 comments

Using Arch Linux, followed their wiki's guide for setting NUT: https://wiki.archlinux.org/title/Network_UPS_Tools

Eaton 5E 2000i USB. I applied the usbhid quirks kernel parameter.

After enabling the NUT enumerator to automatically generate, enable and start the UPS systemd service, my boot times have significantly slowed down:

❯  ~ systemd-analyze critical-chain     
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @17.814s
└─multi-user.target @17.814s
  └─nut.target @15.512s
    └─nut-monitor.service @15.442s +69ms
      └─nut-server.service @15.346s +92ms
        └─nut-driver.target @15.344s
          └─[email protected] @11.243s +4.100s
            └─basic.target @11.241s
              └─dbus-broker.service @11.216s +23ms
                └─dbus.socket @11.201s
                  └─sysinit.target @11.194s
                    └─systemd-udev-settle.service @364ms +10.829s
                      └─systemd-udev-trigger.service @309ms +53ms
                        └─systemd-udevd-kernel.socket @294ms
                          └─system.slice @257ms
                            └─-.slice @257ms

My journal shows:

Sep 24 09:25:19 yuv420p10le systemd[1]: Starting Wait for udev To Complete Device Initialization...
Sep 24 09:25:19 yuv420p10le udevadm[633]: systemd-udev-settle.service is deprecated. Please fix [email protected] not to pull it in.

This in the enumerator script seems to be causing that: https://github.com/networkupstools/nut/blob/7dbf605e82ccc89a1cf0aebf2e174dcadf5b0a93/scripts/upsdrvsvcctl/nut-driver-enumerator.sh.in#L112

yuv420p10le avatar Sep 24 '24 07:09 yuv420p10le

Thanks! Looking at it, specifically at https://www.freedesktop.org/software/systemd/man/latest/systemd-udev-settle.service.html - I am not sure if there is an easy way forward. IIRC, the idea behind that dependency was to have NUT (USB-capable) driver programs start only after the OS has some certainty about available hardware (and supposedly assigns udev-based permissions), and this unit seems to be just about that.

Can you try commenting away that dependency in your deployment and see if it works faster? I suppose in the field, if something is not yet discovered/assigned properly, a driver start-up attempt would fail and another (resuscitation) would succeed when everything settles on its own.

In the graph above, systemd-udev-settle.service is not close to nut-driver@* so maybe something else asks for it too?

jimklimov avatar Sep 24 '24 07:09 jimklimov

Can you try commenting away that dependency in your deployment and see if it works faster? I suppose in the field, if something is not yet discovered/assigned properly, a driver start-up attempt would fail and another (resuscitation) would succeed when everything settles on its own.

Boot is indeed almost instant when I get rid of the system-udev-settle dependency. However I get this error in the journal from nut-driver:

libusb1: Could not open any HID devices: insufficient permissions on everything

Starting the service by hand afterwards does work. Perhaps there might be an issue with the Eaton 5E specifically which causes it to be loaded slowly?

In the graph above, systemd-udev-settle.service is not close to nut-driver@* so maybe something else asks for it too?

Doesn't seem to be the case; I grepped /etc/ for udev-settle and the file generated by nut-driver-enumerator is the only one containing it

Regardless, I'm not too upset about the slower boot time. I don't reboot that often (unless troubleshooting anything else) and I'd rather have my components not get fried in case of some power outage :) The use of the deprecated system-udev-settle service might be worth looking at, however. I'm not versatile enough with Linux yet to suggest a solution unfortunately

yuv420p10le avatar Sep 24 '24 08:09 yuv420p10le

One more idea to try (probably not a good one for "production", rather to see if there are visible differences):

Keep the systemd-udev-settle.service in nut-driver@* as generated, but change the nut.target from [Install] // WantedBy=multi-user.target to add a [Unit] // Wants=multi-user.target (may be an additional line).

And systemctl daemon-reload ; systemctl disable nut.target ; systemctl enable nut.target to recreate the symlinks.

The idea with this experiment is to delay NUT start-up until after the interactive system milestone is achieved (assuming it starts well at all). It has been a while, I am not sure which exact criteria systemd has to start units and targets beside making them wanted/required by common system units, so as to have a reason for NUT to start at all. I mean, manual systemctl enable probably does achieve that, but for systemd-presets configuration IIRC an explicit chain was needed.

Building on this experiment, if we stop at a half-way point, only commenting away the WantedBy but not adding the Wants line, maybe it would achieve the best of both worlds at least with the explicit systemctl enable nut.target - NUT target would start After=local-fs.target is achieved, essentially, and cause the start-up of other NUT units, among which the USB-capable driver would start after udev is ready and settled, and would not be a blocker for the interactive system to start.

Overall, the goal is to have system intelligent power protection starting as early and running as long as possible.

However I get this error in the journal from nut-driver: libusb1: Could not open any HID devices: insufficient permissions on everything Starting the service by hand afterwards does work.

More udev- and device-centric thoughts below. A specific odd item for me here is that I would expect the driver unit to Restart=always indefinitely while it is enabled, trying each RestartSec=15s, so even if it failed to start initially - it should have picked up a bit later on.

I suppose your "starting by hand" took place after more than that time since the initial failure to start (needing time for login, diagnosis and all that)?

Is there anything interesting in journalctl -lu [email protected], or generally in journalctl -xl (system-wide log, must be requested as root) to see the progression of events and decisions systemd made about (re-)starting (or not) certain units - e.g. failed other prerequisites and dependencies? Theoretising here, if perhaps the NUT driver (via libusb) poking into devices could upset initialization of udev, and so a restart attempt was deemed not feasible, or something (I guess should have propagated to try and restart all failed deps, but I digress).

Also, systemctl show [email protected] (when the OS booted and failed to start the driver) would expose dozens or hundreds of lines from what the systemd daemon thinks about the unit. I have no particular idea what to look for there, but maybe something would seem relevant to the failure.

Perhaps there might be an issue with the Eaton 5E specifically which causes it to be loaded slowly?

Could be; however the documentation linked above suggests that "udev settling" waits for all hardware detection, not just the one item you might be interested in, and timing depends on the performance of the weakest link in the chain. Part of why it is not recommended for use, I guess.

You can add debugging verbosity to the driver unit, see https://github.com/networkupstools/nut/wiki/Changing-NUT-daemon-debug-verbosity - to check which devices it sees during reboot, if the Eaton USB IDs (likely 0x0463:*) are there. This way, maybe we can discern if it is a problem of device not yet being seen by the OS (e.g. controller booting after USB signal, dunno), vs. udev not yet having assigned permissions needed for the unprivileged NUT driver process to access the device node.

Alternatively, a user=root in ups.conf section for the device could effectively bypass concerns about those permissions, at the risk/cost of running a privileged daemon.

jimklimov avatar Sep 24 '24 11:09 jimklimov

@yuv420p10le : Cheers, were you able to follow up on any of the experiments proposed above?

jimklimov avatar Jan 03 '25 10:01 jimklimov

Thanks! Looking at it, specifically at https://www.freedesktop.org/software/systemd/man/latest/systemd-udev-settle.service.html - I am not sure if there is an easy way forward. IIRC, the idea behind that dependency was to have NUT (USB-capable) driver programs start only after the OS has some certainty about available hardware (and supposedly assigns udev-based permissions), and this unit seems to be just about that.

raspberry pi debian bookworm says: deprecated

arch-linux man page says: Using this service is not recommended. There can be no guarantee that hardware is fully discovered at any specific time, because the kernel does hardware detection asynchronously, ...

so, this unit is more of a 'wish it was that easy' leaving this in might be more misleading than helpful on debian and arch-linux

danieltrentman avatar Jan 30 '25 01:01 danieltrentman

Why calling systemd-udev-settle.service as dependency which blocks whole system boot and not using 'udevadm settle' command instead:

[Unit]
Description=Network UPS Tools - device driver for NUT device 'nutdev1'
Wants=systemd-udev.service
After=systemd-udev.service
[Service]
ExecStartPre=udevadm settle

axet avatar Apr 09 '25 12:04 axet

Looking at it, as one potential issue there was a typo: there is no systemd-udev.service but there's a systemd-udevd.service.

From posts above, where systemd-udev-settle.service @364ms +10.829s seems to take almost 11 seconds and does a useful job - with the service out of the loop, "However I get this error in the journal from nut-driver: libusb1: Could not open any HID devices: insufficient permissions on everything; Starting the service by hand afterwards does work".

So the main problem is it apparently blocking the boot due to Before=sysinit.target requested in systemd-udev-settle.service.

As for such ExecStartPre: maybe makes sense, as each copy of udevadm settle should attach to the bus and wait for end of outstanding device-processing (or a timeout)... but mostly it seems wasteful: on systems with many NUT drivers there would be as many extra processes running for no good reason and just consuming RAM and some CPU cycles. On small/embedded systems which monitor a lot of UPSes over some USB hub this can be a considerable overhead.

Maybe a sort of win-win solution could be to define and ship a clone of /lib/systemd/system/systemd-udev-settle.service tailored for NUT use - so there would be at most one "overhead" process for specifically NUT programs to wait for, not blocking other services of the system via sysinit.target. And maybe with a different "Type" of life-cycle, so subsequent device re-plugging (or USB bus reset) can be handled by a new wait for udev operations to settle before NUT drivers start.

One more thing that looks concerning in the top post, maybe I misunderstood something - but it seems (per timestamps) that the system only becomes interactive after nut.target completes initialization?

graphical.target @17.814s
└─multi-user.target @17.814s
  └─nut.target @15.512s
...

It may well be that I misunderstood something, but the unit file deliberately only says WantedBy=multi-user.target (e.g. there is no Before=multi-user.target) which I assume to mean that attempts to activate multi-user.target would try once to activate nut.target and then forget about it. Not Require that it even succeeds. Not wait to proceed After it says it is ready. In turn, nut.target pulls whatever is enabled of the drivers, data server, monitoring client...

So the interactive aspect of the system I think should be starting in parallel to and independently from NUT service start-up (notably, drivers can take a while to establish dialog with the hardware). Of course, at some point these unit dependency graphs intersect at a common ancestor or several, but the time specifically NUT takes to start up is intended not to be part of the time that your ability to log into the system is delayed for (modulo contention on one CPU, but still - there's context-switching for multi-tasking OSes).

jimklimov avatar Apr 09 '25 14:04 jimklimov

You can name it as nut-settle.service

axet avatar Apr 09 '25 15:04 axet

Would changes in #2891 seem agreeable? Would you fancy testing it before this becomes part of imminent NUT v2.8.3 release, please? :)

You can follow https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests with, roughly:

:; git clone https://github.com/jimklimov/nut -b issue-2638 nut-issue-2638
:; cd nut-issue-2638
:; ./ci_build.sh inplace && sudo make install-as-root

jimklimov avatar Apr 09 '25 16:04 jimklimov

It never been a good idea to run commands as "make install-as-root". I knew it. And Debian Trixie shows a lot of systemd errors after that commands. Next time please add make deb so I can be safely removed later.

Total boot time become worse from about 50 seconds (with systemd-udev-settle) to 2 minutes:

axet@axet-desktop:~$ systemd-analyze 
Startup finished in 10.075s (firmware) + 554ms (loader) + 8.077s (kernel) + 2min 4.555s (userspace) = 2min 23.263s 
graphical.target reached after 2min 4.554s in userspace.
axet@axet-desktop:~$ systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @2min 4.554s
└─power-profiles-daemon.service @2min 4.530s +24ms
  └─multi-user.target @2min 4.526s
    └─munin-rrdcached.service @2min 4.471s +53ms
      └─munin-node.service @2min 3.053s +1.417s
        └─network-online.target @2min 2.985s
          └─NetworkManager-wait-online.service @10.966s +3.235s
            └─NetworkManager.service @10.630s +327ms
              └─dbus.service @10.487s +103ms
                └─basic.target @10.442s
                  └─systemd-pcrphase-sysinit.service @10.390s +46ms
                    └─sysinit.target @10.345s
                      └─apparmor.service @2.709s +7.633s
                        └─local-fs.target @2.708s
                          └─var-tmp.mount @2.703s +4ms
                            └─swap.target @1.141s
                              └─dev-zram0.swap @946ms +193ms
                                └─[email protected] @907ms +37ms
                                  └─dev-zram0.device @880ms
axet@axet-desktop:~$ 

axet avatar Apr 09 '25 16:04 axet

Would be nice to have "make uninstall-as-root" command, which wipes all changes it made, files it create, services it enabled.

EDIT: second run of 'evil' install-as-root command failed, uninstall-as-root is required:

make  install-data-hook
make[6]: Entering directory '/media/axet/128GB/local/nut-issue-2638/scripts/python'
ln: failed to create symbolic link 'fr/fr_FR.UTF-8.po': File exists
make[6]: *** [Makefile:1361: install-data-hook-app-locale-symlinks] Error 1
make[6]: Leaving directory '/media/axet/128GB/local/nut-issue-2638/scripts/python'
make[5]: *** [Makefile:1276: install-data-am] Error 2
make[5]: Leaving directory '/media/axet/128GB/local/nut-issue-2638/scripts/python'
make[4]: *** [Makefile:1222: install-am] Error 2
make[4]: Leaving directory '/media/axet/128GB/local/nut-issue-2638/scripts/python'
make[3]: *** [Makefile:1059: install-recursive] Error 1
make[3]: Leaving directory '/media/axet/128GB/local/nut-issue-2638/scripts/python'
make[2]: *** [Makefile:631: install-recursive] Error 1
make[2]: Leaving directory '/media/axet/128GB/local/nut-issue-2638/scripts'
make[1]: *** [Makefile:1031: install-recursive] Error 1
make[1]: Leaving directory '/media/axet/128GB/local/nut-issue-2638'
make: *** [Makefile:2190: install-as-root] Error 2

axet avatar Apr 09 '25 16:04 axet

After "install-as-root" my usb keyboard and mouse shutdowns after 1-2 minutes and ignores first stroke I have to type first letter twice and click mouse to enable it. dmesg full of audit errors:

https://paste.debian.net/1368511/

axet avatar Apr 09 '25 17:04 axet

Hm, that install flaw was not expected. Probably you would have got it as part of sudo make install too, if there's an issue with symlink creation (so sudo make uninstall for FS changes). Regarding services, for now see the Makefile recipe (it pre-disables units) or that wiki article the recipe grew from.

As for slower startup, network-online happened at 2min into the boot and was a dependency for munin and others up to graphical. So I can't immediately guess how NUT is at fault with this (it may want network too for some NUT units, but is not a dependency for it). So I'd wage there's something external - with cable, access point, etc.

jimklimov avatar Apr 09 '25 20:04 jimklimov

Didn't fiddle with Debian 13 yet, nor with audit, so would need help making some sense of this :)

Is that from SELinux? As there were no manifests contributed for that, I assume its advanced security is unsupported until proven otherwise. There was a discussion touching on it a month ago, but so far that's it.

In any case, can't say the log excerpt is "full of failures", the only ones with res=failed concern the driver unit:

pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=nut-driver@nutdev1 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'

There's really not much info here for me to go guessing the problem from. Is that some audit failure (OS security is concerned)? Or the driver unit startup failure? What does systemctl status nut-driver@nutdev1 say?

jimklimov avatar Apr 09 '25 22:04 jimklimov

As for packaging (make deb etc.) it was on the plate for quite a while, to import recipes from forked project at 42ity NUT. Still backlogged as more useful tasks steal priority of the few hands that actually do something and post PRs.

Main problem is that packaging becomes a cat-and-mouse chasing game when different spellings of dependencies become concerned or their version constraints, as you wedge one recipe to support 20 years of the distro. Or even 5. And its forks (Debian/Ubuntu/400 others in just one ecosystem). Or end up shipping a hundred "reference" recipes.

Certainly not impossible, something that we do want to do (probably via OBS), and something we have completed for a few less dynamic OSes, but I see how it can snowball out of scope. Did packaging for too many distros back in the day, got scary trench tales to tell :D

jimklimov avatar Apr 09 '25 22:04 jimklimov

Hm, it seems the iteration of the PR branch you've tried lacked actual delivery of nut-udev-settle.service file during make install (should be fixed in the updated branch pushed now).

Maybe that did upset the driver unit which couldn't start the service it Wants, although I could not reproduce the behavior locally - mine just started. Also as the systemd-udev-settle is no longer called by NUT, its absence in the call stack may have impacted your startup somehow?.. Check the driver logs/journal to see if there were permission errors, and if they did or did not dissipate over time?

jimklimov avatar Apr 09 '25 22:04 jimklimov

I had to reinstall whole system to fix all issues. I tried 10 reboots - always got 'network-online' delay. It is intall-as-root related. Could be some 'sudo' or 'audit' related timeout hostname resolving after touching systemd. USB mouse and keyboard got power off and stays that state until I click (mouse or keyboard). I could not figure it out it all, never had anything like that before. But easy to reproduce using freshly installed Debian Trixie + install-as-root. I'm not going to try it again tho. If deb is hard to make because of all distro difference, install-as-root even more dangerous should not be recommended.

Some unexpected things still happening. When I manually disable systemd-udev-settle service udisk2 stops to auto mount USB devices at desktop startup. I keep digging into it with fresh debian installation.

axet avatar Apr 10 '25 03:04 axet

Well... I am surprised it misbehaved like this and caused complications, and would like to tender my apologies for those.

However, the recipe in Makefile is a way to automate instructions from https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests which were intensively tried by community for the past year or two - just to avoid error-prone copy-pasting from Wiki, and newly to create/chmod/chown certain paths created by NUT (especially in case the custom build is not overlaying a packaged installation which prepared those already). It does not create the user account though, that's up to end-user's root.

Notably, the Wiki instructions were intensively tried before Debian 13 was a thing, I guess, so maybe there's some change of behavior (bordering on regression then) in the distro.

It would be really interesting to unravel this mystery, so if you could stage another experiment to dissect where it goes sour, it would be great. Something along the lines of:

  • Install Debian 13, make absolutely sure it does not misbehave (time to boot, delay for network-online, kbd/mouse, USB storage...) on its own;
  • Not sure if you installed packaged NUT and built your own as "in-place" replacement, or if you built from scratch (create user/group accounts, create/chmod/chown statepath/pidpath/etc. dirs as in install and security docs, maybe use Debian-style configuration similar to https://github.com/networkupstools/nut/wiki/Building-NUT-on-Debian,-Raspbian-and-Ubuntu monstrosity)
  • Build NUT, sudo make install to deliver just the inert files
    • create/chmod/chown statepath/pidpath/etc. dirs if applicable
    • do not create NUT configs yet
  • Perhaps move away /usr/lib/systemd/system-preset/nut-systemd.preset and reboot at this point, to check if OS behavior changes (new udev rules and tmpfiles configs are delivered and would probably be applied by the reboot, but I can't think of any particular other impact)
  • (Re-)Enable NUT services as in Wiki, adding the new one from this PR; this part:
   sudo systemctl daemon-reload && \
   sudo systemd-tmpfiles --create && \
   sudo systemctl disable nut.target nut-udev-settle.service nut-driver.target nut-monitor nut-server nut-driver-enumerator.path nut-driver-enumerator.service && \
   sudo systemctl enable  nut.target nut-udev-settle.service nut-driver.target nut-monitor nut-server nut-driver-enumerator.path nut-driver-enumerator.service && \
   { sudo systemctl restart udev || true ; } && \
   sudo ./scripts/upsdrvsvcctl/nut-driver-enumerator.sh --reconfigure && \
   sudo systemctl restart nut-driver-enumerator.service nut-monitor nut-server
  • Reboot to check for significant misbehavior after this
    • NUT services would probably quickly fail to start due to lack of configs
  • Populate your NUT configuration files as applicable
    • nut-driver-enumerator should auto-create the [email protected] instance as you edit ups.conf
  • Reboot to check for significant misbehavior after this
    • NUT services should start if configs are correct

jimklimov avatar Apr 10 '25 11:04 jimklimov

Rebased the PR to include some other recent recipe fixes from master, previous iteration was failing on CI without them (got something merged, added that master here, it did not always build well, was fixed there but not here, you know the drill) :\

jimklimov avatar Apr 10 '25 11:04 jimklimov

@axet : during addition of Debian 13 to the build farm per #2893, I found it newly needs apt install systemd-dev for our configure script to auto-locate the directories for unit file installation. So that might also impact things.

jimklimov avatar Apr 10 '25 16:04 jimklimov