telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

feat(inputs.systemd_units): Allow to query unloaded/disabled units

Open srebhan opened this issue 1 year ago • 7 comments

Summary

Unloaded or disabled units are neither listed with systemctl list-units not systemctl show as systemctl only will work on loaded units (see https://github.com/systemd/systemd/issues/5063) especially with wildcards!

The only way to make this work is to determine the unit-files matching the patterns and then use the exact unit-name to infer further information as this will enforce loading the unit by systemd. Unfortunately, this is not particularly easy using the systemctl command, therefore, this PR basically rewrites the inputs.systemd_units plugin to be based on DBUS communication.

Important note: Currently, multi-instance services (i.e. services containing an @) will not work with this implementation. I'm working on a workaround...

Checklist

  • [x] No AI generated code was used in this PR

Related issues

resolves #14763

srebhan avatar Feb 13 '24 23:02 srebhan

Output with this build looks so good!

[[inputs.systemd_units]]
  pattern = 'telegra* crond.service' 
  subcommand = "show" 

Output for inactive (dead) and disabled service "telegraf" (using wildcards for search pattern) and a running active enabled service "crond":

> systemd_units,active=inactive,host=localhost.localdomain,load=loaded,name=telegraf.service,sub=dead,uf_preset=disabled,uf_state=disabled active_code=2i,load_code=0i,mem_current=9223372036854775807i,pid=0i,restarts=0i,status_errno=0i,sub_code=1i 1707924229000000000
> systemd_units,active=active,host=localhost.localdomain,load=loaded,name=crond.service,sub=running,uf_preset=enabled,uf_state=enabled active_code=0i,load_code=0i,mem_current=1167360i,pid=906i,restarts=0i,status_errno=0i,sub_code=0i 1707924229000000000

1tft avatar Feb 14 '24 15:02 1tft

@1tft and @knollet I would ask you to also check multi-instance units (the ones with an @)! Same for dynamically starting/stopping, enabling/disabling or creating new units while Telegraf is running...

srebhan avatar Feb 14 '24 16:02 srebhan

My tests so far: Monitored at least 2 services same time, using wildcards in pattern. Started, removed, disabled and created new services (e.g. smartd). Killed service too. One time I used also pattern = "*" systemd_units output created always valid metrics. Checked most of fields regarding plausability.

Telegraf log was clean (normal logging mode - no debug on).

systemctl --version systemd 239 (239-68.el8)

Only issue/anomaly: A loaded but inactive service produces as mem_current value a max 64-bit signed integer. Example:

]# systemctl status smartd.service 
● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
   Loaded: loaded (/usr/lib/systemd/system/smartd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:smartd(8)
           man:smartd.conf(5)

Metrics: systemd_units,active=inactive,host=localhost.localdomain,load=loaded,name=smartd.service,sub=dead,uf_preset=enabled,uf_state=disabled status_errno=0i,mem_current=9223372036854775807i,load_code=0i,sub_code=1i,pid=0i,active_code=2i,restarts=0i 1707942750000000000

I hope to find time for testing @ services tomorrow.

1tft avatar Feb 14 '24 20:02 1tft

Tested again. Now normal "static" services (services without [install] part) and multi instances services.

Telegraf log was clean (normal logging mode - no debug on).

systemctl --version systemd 239 (239-68.el8)

telegraf --version Telegraf 1.30.0-125c0bec (git: pull/14814@125c0bec)

Like yesterday and uncritical: A loaded but inactive service produces as mem_current value with max 64-bit signed integer value.

Maybe changelog and systemd_units README.md should mention that pattern search now relies on unit file name. That means following change regarding seach pattern wildcard use:

When we have a service "lvm2-pvscan@252:2.service": systemctl status lvm2-pvscan@2*

● lvm2-pvscan@252:2.service - LVM event activation on device 252:2
   Loaded: loaded (/usr/lib/systemd/system/[email protected]; static; vendor preset: disabled)
   Active: active (exited) since Thu 2024-02-15 21:00:57 CET; 15min ago
     Docs: man:pvscan(8)
  Process: 1617 ExecStop=/usr/sbin/lvm pvscan --cache 252:2 (code=exited, status=0/SUCCESS)
  Process: 1627 ExecStart=/usr/sbin/lvm pvscan --cache --activate ay 252:2 (code=exited, status=0/SUCCESS)
 Main PID: 1627 (code=exited, status=0/SUCCESS)

The old Telegraf v1.29.4 produces results with following config

[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@2*
  # subcommand = "show"
  interval = "30s"

systemd_units,active=active,host=localhost.localdomain,load=loaded,name=lvm2-pvscan@252:2.service,sub=exited load_code=0i,active_code=0i,sub_code=4i 1708028550000000000 systemd_units,active=active,host=localhost.localdomain,load=loaded,[email protected],sub=running active_code=0i,sub_code=0i,load_code=0i 1708028550000000000

Now Telegraf v1.30.0 does not produce any output with following config

[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@2*
  subcommand = "show"
  interval = "30s"

OR


[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@2*
  # subcommand = "show"
  interval = "30s"

_

User now have to use:

[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@*
  subcommand = "show"
  interval = "30s"

OR


[[inputs.systemd_units]]
  pattern = 'lvm2-pvscan@*
  # subcommand = "show"
  interval = "30s"

1tft avatar Feb 15 '24 20:02 1tft

@1tft thanks for the finding. I will try to fix the changed behavior...

srebhan avatar Feb 16 '24 10:02 srebhan

Sadly found a serious issue regarding new release telegraf --version Telegraf 1.30.0-7f8e03bf (git: pull/14814@7f8e03bf)

For static service (service without [Install] section) like ● testservice.service - TestService Loaded: loaded (/usr/lib/systemd/system/testservice.service; static; vendor preset: disabled) Active: inactive (dead)

telegraf produces 2 lines of metrics

systemd_units,active=inactive,host=localhost.localdomain,load=stub,name=testservice.service,sub=dead sub_code=1i,load_code=1i,active_code=2i 1708267290000000000
systemd_units,active=inactive,host=localhost.localdomain,load=stub,name=testservice.service,sub=dead load_code=1i,sub_code=1i,active_code=2i 1708267290000000000

After starting testservice.service telegraf produces 2 lines again, and does not recognized that service has been started. Also systemctl daemon-reload was no solution. Installed telegraf-1.30.0~125c0bec-0.x86_64.rpm again, produced immediately correct result. Installed telegraf-1.30.0~7f8e0bec-0.x86_64.rpm again, still produced false results.

Also reproduced issue when removing [Install] section from existing standard service like smartd. Than it was "static" and telegraf produced two false (indentical) metrics for every interval.

My testservice: cat /usr/lib/systemd/system/testservice.service

[Unit]
Description=TestService

[Service]
Type=forking

ExecStart=/bin/bash -c "sleep 2; $(sleep 60) &"

TimeoutStartSec=15
TimeoutStopSec=2

RemainAfterExit=true

Restart=on-failure
RestartSec=60
StartLimitInterval=300
StartLimitBurst=2

1tft avatar Feb 18 '24 14:02 1tft

Thank you very much @1tft for the thorough testing! That's much appreciated and invaluable! Hope I got all the corner cases right this time...

srebhan avatar Feb 19 '24 10:02 srebhan

Thank you for new build! Looks very good, no double lines and service status is recognized correctly again by telegraf.

Regarding mem_current. For static and not running services no mem_current is printed out.

● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
   Loaded: loaded (/usr/lib/systemd/system/smartd.service; static; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:smartd(8)
           man:smartd.conf(5)

systemd_units,active=inactive,host=localhost.localdomain,load=stub,name=smartd.service,sub=dead load_code=1i,active_code=2i,sub_code=1i 1708415340000000000

But for other inactive services mem_current is still printed out:

● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
   Loaded: loaded (/usr/lib/systemd/system/smartd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:smartd(8)
           man:smartd.conf(5)

systemd_units,active=inactive,host=localhost.localdomain,load=loaded,name=smartd.service,sub=dead,uf_preset=enabled,uf_state=disabled pid=0i,load_code=0i,status_errno=0i,restarts=0i,mem_current=9223372036854775807i,active_code=2i,sub_code=1i 1708414470000000000

 lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
   Loaded: loaded (/usr/lib/systemd/system/lvm2-monitor.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Tue 2024-02-20 08:45:59 CET; 19s ago
     Docs: man:dmeventd(8)
           man:lvcreate(8)
           man:lvchange(8)
           man:vgchange(8)
  Process: 1614 ExecStop=/usr/sbin/lvm vgchange --monitor n (code=exited, status=0/SUCCESS)
 Main PID: 701 (code=exited, status=0/SUCCESS)
 
systemd_units,active=inactive,host=localhost.localdomain,load=loaded,name=lvm2-monitor.service,sub=dead,uf_preset=enabled,uf_state=enabled load_code=0i,restarts=0i,mem_current=9223372036854775807i,pid=0i,active_code=2i,sub_code=1i,status_errno=0i 1708415220000000000

1tft avatar Feb 20 '24 07:02 1tft

@1tft thanks for testing this again!

Hmmm I think all of those are a lie if I look at the number... ;-) The issue is that for the disabled units, systemd provides those numbers on request, while we can't get those for the "static" ones. Do you want me to set those fields to zero? I mean in InfluxDB you would get a nil value for that field and can fill it with whatever you need in your calculation...

@powersj any preferences?

srebhan avatar Feb 20 '24 10:02 srebhan

Regarding mem_current: According to dbus-specification there is a dbus-type t which is UNsigned 64bit. https://dbus.freedesktop.org/doc/dbus-specification.html#basic-types

gdbus introspect --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1/unit/ypbind_2eservice (a random service that doesn't run on my system) return lots of stuff including this: readonly t MemoryCurrent = 18446744073709551615;

So I think: telegraf should return an unsigned mem_current if possible.

If can't be returned unsigned, it should be -1 in case of a non-running service or be redacted. (I think I would check if there are marker values for when systemctl show returns [not set] and redact those. The alternative -- having a list of valid data fields for every service status combination -- seems less reasonable to me)

For clarity, I would say:

  • Use the type (unsigned) you get from upstream (dbus) (for me, as a C programmer, LLONG_MAX is really weird)
  • Redact mem_current if it is 2^64. (do so for every other field you gather).
  • Don't have this list:
{
    "running": ["mem_current", ...],
    "exited": [...],
    ...
}

knollet avatar Feb 20 '24 11:02 knollet

@knollet not sure what you mean... We currently return what we get from DBUS without any modification. Regarding redacting/removing fields, I'm not sure if we should do this...

srebhan avatar Feb 20 '24 13:02 srebhan

Actually you don't. I don't know if you programmed anything or if the go dbus library is buggy, but you changed numbers.

Do a gdbus introspect --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop /systemd1/unit/ypbind_2eservice | grep "MemoryCurrent" with a non-running service, and you will see: readonly t MemoryCurrent = 18446744073709551615;

You return for a non-running service mem_current=9223372036854775807i (see above) which is 2^63-1. (signed int max, 0x7FFFFFFFFFFFFFFF) DBus returns 18446744073709551615 which is actually 0xFFFFFFFFFFFFFFFF (2^64-1) typed t (defined in this document: https://dbus.freedesktop.org/doc/dbus-specification.html#basic-types ) which is UNsigned int, not signed.

You are converting it to a signed int for some reason and losing half the number doing it.

If you convert 2^64-1 to signed int, though, a special value which should indicate that the value is actually missing, I would either return unsigned int max, or -1, which would both be the same bitwise: 64 1 bits. The value you currently return is not, it's 0 zero bit and 63 one bits, which is weird.

knollet avatar Feb 20 '24 14:02 knollet

@knollet are you sure you enabled influx_uint_support = true?

srebhan avatar Feb 20 '24 16:02 srebhan

@1tft it may well be that this is the culprit.

knollet avatar Feb 20 '24 16:02 knollet

@knollet Setting influx_uint_support = true at output section prints mem_current=18446744073709551615u instead of 9223372036854775807i

@srebhan So we are OK with everything regarding mem_current handling (current behaviour, 0 or supressing complete field). I think 0 would be better than this very high value 9223372036854775807 / 18446744073709551615, only because for diagram scaling reasons. For InfluxDB its not relevant, because 9223372036854775807 is max... Supressing only mem_current is also not a good solution.

I think Readme.md should mention this special situation, than everybody can decide to build a custom processor for this special value and we do not have to invest any more time on this edge case.

1tft avatar Feb 20 '24 20:02 1tft

One more tweak... I now set the "invalid"/"unset" memory fields to zero. I also noticed that before (with the systemd command) the memory values were provided as signed-integers which is plain wrong as the dbus type is unsigned integer. So I fixed this as the feature was not yet released...

srebhan avatar Feb 23 '24 11:02 srebhan

@1tft could you please give this another (hopefully last) round of testing!?

srebhan avatar Feb 26 '24 11:02 srebhan

Already tested a little bit on weekend and will do it today again.

1tft avatar Feb 26 '24 13:02 1tft

Have done some more tests and found no real issue.

Only one remark. For a disabled and static service (service without [Install] section inside unit file) tags "uf_preset" and "uf_state" and fields "pid" and "status_errno" are not printed out.

Example

● smartd.service - Self Monitoring and Reporting Technology (SMART) Daemon
   Loaded: loaded (/usr/lib/systemd/system/smartd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:smartd(8)
           man:smartd.conf(5)

systemd_units,active=inactive,host=localhost.localdomain,load=stub,name=smartd.service,sub=dead swap_current=0i,swap_peak=0i,load_code=1i,sub_code=1i,mem_current=0i,active_code=2i,mem_peak=0i,mem_avail=0i 1708977990000000000

1tft avatar Feb 26 '24 20:02 1tft

I think the pid and status_errno cannot be filled, so I would really love to keep those out. For the tags, I have to check if I can read those somehow... I will try to come back to you tomorrow @1tft. Does that work for you?

srebhan avatar Feb 26 '24 20:02 srebhan

Yes, I can test at evening again. I think nobody really misses these fields/tags for a disabled static service but because it is different to the "normal" output I want to mention it.

1tft avatar Feb 27 '24 07:02 1tft

@1tft next trial. I also replaced the subcommand by a details setting because the former does not make sense anymore in the current implementation. Additionally, I renamed uf_state and uf_preset by state and preset to match the systemctl output.

Would appreciate if you are able to test this once again...

srebhan avatar Feb 27 '24 14:02 srebhan

Thank you again! Looks very good. Made some tests, all successful. All systemd_units plugin information is consistent and hamonizes with systemctl status. Will test a little bit more tomorrow forenoon but dont think to find any issues...

1tft avatar Feb 27 '24 19:02 1tft

@srebhan,

You have effectively re-written the plugin. While this gets it in a "cleaner" state for maintaining it, I am concerned about impact to existing users. The breaking-changes you have made I think are all ok, as they technically have not been released beyond nightly builds.

However, could we break this up into a test PR and then a follow-up PR with the larger changes? That way we can ensure existing users are not going to run into any issues?

powersj avatar Feb 27 '24 20:02 powersj

FYI: Tested a little bit more today and found nothing to complain/mention.

1tft avatar Feb 28 '24 21:02 1tft

@srebhan Right after I approved I saw this in the lint-linux failure: 🔴 Git repo dirty. Please run "make docs" and push the updated README. Failing CI. Is this erroneous or does it need to be run?

mstrandboge avatar Mar 04 '24 21:03 mstrandboge