nut icon indicating copy to clipboard operation
nut copied to clipboard

nut-driver-enumerator.sh modifies /etc (and maybe runs too late)

Open bigon opened this issue 3 years ago • 4 comments

Hello,

I know I'm probably late for the party here (I didn't really review the PR for this in time), but I've done the work to update NUT to 2.8.0 in debian (thanks for the work for this new release 🎉) and I see that the enumerator is modifying/adding files in /etc, and I'm wondering why.

Usually, a systemd generator adds files in /run/systemd/<something> at each boot to achieve the same goal.

Doing this on a tmpfs (/run) allows to have a read-only /etc/ partition and makes cleanup easier if the user uninstall the package and probably doesn't require all the checksum logic at all

Also generators also usually live in /lib/systemd/system-generators (and don't require a .service file) otherwise it might run to late during the boot if it's started by a .service

An idea why the current solution was chosen?

My proposal here would be:

  1. Remove nut-driver-enumerator.service and nut-driver-enumerator.path
  2. Adjust nut-driver-enumerator.sh to create the driver .services files in /run/systemd/generator or /run/systemd/generator-late (check which one) and remove the checksum logic
  3. Move nut-driver-enumerator.sh to /lib/systemd/system-generators/

Also, the usage of nut-driver-enumerator.path seems racy to me, what happens if the user saves the file (/etc/nut/ups.conf) in the middle of an edit and that the file is not consistent or not formatted properly. If you remove this, the user will have to run systemctl daemon-reload after the file is edited, that should reduce the risk

bigon avatar Jul 13 '22 07:07 bigon

Cheers, here's to why timely reviews are good :)

For a historic note, the NDE appeared as part of my work on a NUT fork for the 42ITy project (used further in Eaton IPM ecosystem) where NUT has to monitor and manage hundreds of devices from a single installation. This made the original single-unit service approach (for all of NUT or "just" all of its drivers) impractical - whenever anything failed and restarted, they all did, and it could take many minutes to initialize the whole wad. Likewise for addition of newly discovered devices (or removal of those discarded by the DCIM admins). So we went for templated service unit instances, and from that - to a generalized lifecycle approach to maintain the unit definitions vs. changes in ups.conf section headcount and contents (and/or the global options).

In a way this was deemed useful also for smaller-scale users, to have their instantiated services per-driver defined, started, restarted etc. even if they have just one UPS or ePDU to monitor (since different protocols/media have different service dependencies - e.g. networked or not). I believe by default this is all optional - users would manage things "manually" unless they choose to automate, but gotta revise the statement to be sure. My systems do use NDE to simplify rebuilds and redeployments, so I may be foggy about default new-end-user setups.

The script may run once (manually or as a single-shot service upon boot and/or detected path change) or continuously as a daemon (more load, not needed on systems that define an UPS once and do not touch the config for years). At least the latter use-case, and possibly other use-cases too, does jump through hoops to check if the config file was changed after the moment when we started to inspect it - by the time we finished; the processing would run again if the user (or 42ITy discovery component) edits the main ups.conf file while we are applying it.

For a practical fix, good ideas that retain the existing functionality (generally for both systemd and SMF, eventually might be more such frameworks) and add/fix some for new use-cases are welcome :)

In particular, at the time this was coded, /etc was deemed writable (it stores "this" system's configuration data after all, and we do edit the /etc/nut/ups.conf to define or modify driver sections, right?) and persistent (so on reboot we do not have to fully revise and redefine all units when no driver sections had changed - and this is something we can check more "cheaply" than a full inspection, to the point of trusting FS timestamps of extracted sections vs. main config file). I suppose it can be at least an option for the script (and services wrapping it) - where it would store the files it writes, both for instantiated units (/etc or /run) and for the excerpted config sections (someplace in /var if not in /etc?).

  • Non-persistent excerpts would probably require the full parse at boot to decide what unit instances we need to define (or re-define if those are persistent).
  • Non-persistent unit definitions are a lag for monitoring during boot (when does NDE first run at all, and how long it takes to parse the config)
  • Parsing hundreds of sections and making unit instances from that can take very non-trivial time, something worth avoiding for every boot of a larger-scale monitoring system.

Overall, the experience with systemd was fun but disappointing, we found too many ways to get it stuck and puzzled (including by calls to daemon-reload), and 42ITy included many belts-and-suspenders around it to get something really reliably working :\

jimklimov avatar Jul 13 '22 18:07 jimklimov

FWIW, there was a PR started at #682 to optimize some scripted logic in NDE, which was not finished to my satisfaction (IIRC I had some more code locally for testing, but was switched to another task so was not certain if/how-well it worked). Not sure if it addressed any of the concerns you raised, but at least is worth mentioning (and I do hope to finish that PR someday) :)

jimklimov avatar Jul 13 '22 22:07 jimklimov

Thanks for your reply

systemd generator are described here (full doc)

Generator are running really early during the boot, earlier than .services started by the multi-user target

Creating the units is just a matter of reading the configuration file and creating symlinks on a tmpfs filesystem, would that take that long? Even with a lot of devices?

bigon avatar Jul 19 '22 18:07 bigon

Well, it is not just symlinks - at least not to one file (might make sense to deliver a dozen files for different driver types) since various media needs different prerequisite service units. Currently this is handled by NDE generating systemd drop-in files that layer over a "standard" [email protected] unit definition, and similar trickery done for SMF. Either way, to decide which sort of unit is needed, the config needs to be parsed and current script is not too fast about that for huge setups, more so on embedded systems. I think the stalled PR that I hope to come back to, did have some changes not merged/tested yet that speed it up for some systems (e.g. "dumb POSIX" vs "bash" run-time choices give a 30x factor of speed change). Doing so for maybe even some minutes once is bearable; doing it on every boot - of a DCIM monitoring station no less while its monitored herd is recovering from power loss - may be unwise.

Depending on hardware and other setup, there will be dependency trees we can not cater for with whatever we script out of the box, so some users will need persistent setups (maybe mixing their drop-ins via /etc/... with something that a generator provides in /run/... might work).

Like recently there was a discussion of a dummy or clone driver that represents another device from the same box. There will be a "real" device driver needed by upsd to start, and a replicator driver which needs that real driver and upsd to serve its data points, to start by itself. And then the same upsd needs to represent it too. At least, current version of upsd has an option to start (not bail out) without all driver sections defined, so a unit reload would slurp them into the running daemon; I think it should have been able to start with a defined but currently unavailable driver in earlier versions too. This is just an example of something people would in reality want to set up, and our mechanisms should not preclude or override that.

For a majority of cases, I suppose that NDE being optional and probably executed once (as a script, not service) during initial setup of an UPS and leaving the drop-in files in /etc that linger for years until the UPS is replaces, much of this discussion may be moot - and special deployments would somehow amend or override their packaging like 42ity did.

jimklimov avatar Jul 23 '22 09:07 jimklimov