node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

Systemd service metrics missing when loaded but disabled

Open hueyg opened this issue 6 years ago • 40 comments

Host operating system: output of uname -a

3.10.0-862.11.6.el7.x86_64 Red Hat Enterprise Linux Server release 7.5 (Maipo)

node_exporter version: output of node_exporter --version

node_exporter, version 0.16.0 (branch: HEAD, revision: d42bd70f4363dced6b77d8fc311ea57b63387e4f) build user: root@a67a9bc13a69 build date: 20180515-15:52:42 go version: go1.9.6

node_exporter command line flags

ExecStart=/home/prometheus/node_exporter/node_exporter --collector.systemd

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Have a custom systemd service defined in /etc/systemd/system for the Keepalived daemon. Running the following query returns the expected results with all five defined states: node_systemd_unit_state{instance="x.x.x.x:9100",name="keepalived.service"}

node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="activating"} 0 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="active"} 1 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="deactivating"} 0 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="failed"} 0 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="inactive"} 0

Once i issue a sudo systemctl stop keepalived.service and run the query again, then the prometheus returns nothing. It is as if the service was never defined. I run the query and don't filter on job name and every other service is returned. Once I start the service, the metrics will return again.

What did you expect to see?

Expected to continue to see the states returned, but with Active=0 and Inactive=1

What did you see instead?

No metrics for the service were returned period. Nothing. Blank screen. I have an a secondary server which I thought was an exact mirror image of the server exhibiting the issue and it does not experience this problem. Thank you for everyone's time.

hueyg avatar Sep 20 '18 18:09 hueyg

Just to add more information, this appears to be an issue with user defined services only as the default system services do not disappear after stopping.

hueyg avatar Sep 20 '18 21:09 hueyg

That's very strange. Prometheus is doing fairly simple requests for data over dbus. Perhaps it's a systemd bug?

Maybe a bug with ListUnitsFiltered dbus request. We could try going back to ListUnits, and filter the loaded ones out in the exporter instead of trusting systemd.

SuperQ avatar Sep 21 '18 07:09 SuperQ

@hueyg Can you verify https://github.com/prometheus/node_exporter/pull/1083 fixes the issue?

SuperQ avatar Sep 21 '18 14:09 SuperQ

@SuperQ Hey Ben, I apologize but I am a pretty ignorant Github user. I can see where it looks like you have updated the GO code to perform some more checks on the state of the units defined in SystemD. Does this mean that you want me to recompile a new version of node_explorer with this updated code and try again?

hueyg avatar Sep 21 '18 15:09 hueyg

@hueyg Yes, if you can checkout the code, and build it, that would help. Otherwise I can post a binary if you trust me. :grin:

You can follow the standard build instructions but run git checkout superq/systemd_filter before you run make build.

SuperQ avatar Sep 21 '18 15:09 SuperQ

In my test GCE instance of CentOS 7.5, I see this difference in metrics: Before:

node_systemd_units{state="active"} 101
node_systemd_units{state="inactive"} 58

After:

node_systemd_units{state="active"} 101
node_systemd_units{state="inactive"} 75

But I don't see a difference in the number of unique units in node_systemd_unit_state, which I only see 160 of. Very strange.

SuperQ avatar Sep 21 '18 15:09 SuperQ

@SuperQ I have no problem if you want to post a compiled binary, but I will work on it in the meantime. I am really pushing to get this resolved because this is a show stopper for the project. It definitely seems related to custom defined services. What has me totally confused is that this problem is not exhibited on what I am pretty sure is an identical secondary server. This is group of two HAProxy servers with custom units defined for HAProxy and KeepaliveD.

hueyg avatar Sep 21 '18 15:09 hueyg

@SuperQ I think I found the issue Ben. Give me ten more minutes.

hueyg avatar Sep 21 '18 15:09 hueyg

@SuperQ First let me apologize if this is a known issue/requirement but the difference is that the custom defined unit was specifically "enabled" on the working server and "disabled" by default on the non-working server. My limited understanding of SystemD is that this simply means whether the unit was set to start at runtime or not. So the "STATE" of the until file itself plays a role in what node_exporter can see. Once I set the custom service to enable: sudo systemctl enable haproxy.service Once that command is issued the service will still be returned from node_explorer after being issued a stop command.

hueyg avatar Sep 21 '18 16:09 hueyg

Interesting, I thought for sure that even a disabled but running service would show up.

SuperQ avatar Sep 21 '18 16:09 SuperQ

@lucab Do you have any idea why a service like this wouldn't show up in ListUnits():

# systemctl status chronyd.service
● chronyd.service - NTP client/server
   Loaded: loaded (/etc/systemd/system/chronyd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

SuperQ avatar Sep 21 '18 17:09 SuperQ

@hueyg I did some additional testing, it doesn't seem to matter if the stopped/disabled unit is in /etc/systemd/system or /usr/lib/systemd/system. It fails to show up when stopped/disabled.

SuperQ avatar Sep 21 '18 17:09 SuperQ

@SuperQ I think that's because the unit is inactive and disabled. Additionally, I fear that at some point the ListUnits DBus method may have changed semantics as its documentations mentions "loaded units" everywhere (or either the doc or the code is wrong). My suggestion would be to try using ListUnitsFiltered instead.

lucab avatar Sep 21 '18 18:09 lucab

@lucab It seems like ListUnitsFiltered() has the same problem. systemctl status says it's loaded, but when we ask for the "loaded" list, inactive/disabled are missing.

SuperQ avatar Sep 21 '18 21:09 SuperQ

I double checked, same problem with ListUnitsFiltered([]string{}). "loaded" but disabled are not returned.

SuperQ avatar Sep 21 '18 21:09 SuperQ

@SuperQ which OS and systemd version are you seeing this on (OP was on RHEL7.5)? I can carry this over to a go-systemd ticket and have a look as soon as I have time.

lucab avatar Sep 25 '18 14:09 lucab

@lucab I was testing with CentOS 7.5 and Ubuntu 18.04 (systemd 237).

Thanks, feel free to ping me on the go-systemd upstream.

SuperQ avatar Sep 25 '18 14:09 SuperQ

Getting back to this, it looks like systemctl status is tricking us due to ephemerally loading the observed unit, which is instead not loaded right before or after the observation (being disabled and inactive).

It looks like there are DBus methods to get the "enabled state" for unit files, and to get the "active state" for enabled units, but I don't think there is a single method to get the union of those (and the primary keys are different object-types).

In the end, I think this boils down to semantics. This collector is in fact reporting activation state for loaded units, but that set is dynamic and also influenced by observers.

lucab avatar Oct 12 '18 12:10 lucab

Thanks, one option is we could use the list unit files function.

SuperQ avatar Oct 12 '18 17:10 SuperQ

@SuperQ I'm planning to do some PoC for this later tonight (using ListUnitFiles as the base instead of ListUnits) if you're not already working on it?

miono avatar Oct 15 '18 11:10 miono

@miono No, I haven't started on it. Looking forward to the PoC. Thanks!

SuperQ avatar Oct 15 '18 11:10 SuperQ

So my initial thought was to keep the call to ListUnits and add another call to ListUnitFiles. Then diffing the loaded units with the unit-files.

By adding a bool to the unit-struct called "enabled" or something we could just add the disabled unit-files as unit-structs with 0's in (activating|active|deactivating|failed|inactive) and also populating this field for the loaded units with the data we get from ListUnitFiles.

However:

  • This diff would risk not being 100 % accurate, since only the last part of the unitfile-path would be diffable with the list of loaded units (For example /lib/systemd/system/blabla.service and /etc/systemd/system/blabla.service would show up twice in the call from ListUnitFiles, I guess we would need to implement the same priority systemd is using for loading unit-files to check if the correct unit-file is enabled or not).

  • More importantly: This would conflict with the intent of #567 since we would start exposing metrics for stuff that shouldn't be enabled. In commit 0fdc0891 && unit.LoadState == "loaded" was added, to only show loaded units.

That behaviour would be more desirable for us at my workplace, since we're using a whitelist-parameter. But of course everyone aren't us.

What is the desired behaviour? My two cents is that it's confusing with metrics that suddenly disappear, it can also cause some problems if you're alerting on active = 1 and then there's no such metric, and no alert, if a by mistake-disabled service stops running.

miono avatar Oct 15 '18 18:10 miono

I have the same issue when running node exporter of the latest version in a container -- nginx and postgresql service statuses are not returned at all when these services are stopped but when you start them node-exporter shows all possible statuses with 1 on the active status however mysqld service is shown correctly-when it's stopped the statuses are returned with 1 showing on inactive

saniatk1985 avatar Oct 18 '18 07:10 saniatk1985

In my case I will workaround for now by using blackbox exporter to query endpoint, but that's not exactly the same as checking if process is up and in many cases process may not have any endpoints, hope it will be fixed some time soon

mlushpenko avatar Mar 25 '19 14:03 mlushpenko

@mlushpenko The best option is to have a Prometheus /metrics endpoint on the service. This provides both the blackbox check and service status, eliminating the need for watching systemd at all. :smile:

SuperQ avatar Mar 25 '19 14:03 SuperQ

Hi @SuperQ , is there any fix on this? I still notice that in NodeExporter v0.17.0 this issue still exists. Sincerely thanks...

zhanglijingisme avatar Jul 01 '19 07:07 zhanglijingisme

There is no current fix, because systemd does not provide the required information over dbus.

SuperQ avatar Jul 01 '19 09:07 SuperQ

I asked how to keep a unit loaded even when stopped here: https://github.com/systemd/systemd/issues/5063#issuecomment-518456418

This is the response I got:

https://github.com/systemd/systemd/issues/5063#issuecomment-518553166

Use RefUnit() via the bus to continuously reference a unit. In that case it stays loaded until you call UnrefUnit(), or disconnect from the bus, and no other reason is in place to keep it loaded. RefUnit() is available to privileged clients only and since v232 (i.e. ~2016)

I looked at the code

https://github.com/systemd/systemd/blob/f3d3a9ca0734c298cc3bf08f8c4907dd19ee9939/src/core/dbus-manager.c#L2488

https://github.com/systemd/systemd/blob/f3d3a9ca0734c298cc3bf08f8c4907dd19ee9939/src/core/dbus-manager.c#L654

https://github.com/systemd/systemd/blob/f3d3a9ca0734c298cc3bf08f8c4907dd19ee9939/src/core/dbus-manager.c#L567

But I'm still not sure how to use RefUnit in a systemd unit file.

Are these links helpful?

My work around to get on, off, failed state information from my process manager to my metrics system is to use supervisor instead.

vagrant@srv0:~$ sudo systemctl start app{1,2,3}
vagrant@srv0:~$ sudo supervisorctl start app{1,2,3}
app1: started
app2: started
app3: started
vagrant@srv0:~$ sudo systemctl stop app3
vagrant@srv0:~$ sudo supervisorctl stop app3
app3: stopped
vagrant@srv0:~$ curl -s localhost:9100/metrics | grep 'app[123]' | grep state
node_supervisord_state{group="app1",name="app1"} 20
node_supervisord_state{group="app2",name="app2"} 20
node_supervisord_state{group="app3",name="app3"} 0
node_systemd_unit_state{name="app1.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="app1.service",state="active",type="simple"} 1
node_systemd_unit_state{name="app1.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="app1.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="app1.service",state="inactive",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="active",type="simple"} 1
node_systemd_unit_state{name="app2.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="inactive",type="simple"} 0

mbigras avatar Aug 06 '19 23:08 mbigras

Looking more closely at another comment

https://github.com/systemd/systemd/issues/5063#issuecomment-518524231

A service will stay loaded if it is wanted/required/etc by something...

I was able to keep a reference to an inactive unit by creating a webapp.service and a webapp.target that wants the webapp.service

This looks like it works!

systemctl cat webapp.{service,target}
# /etc/systemd/system/webapp.service
[Service]
User=webapp
ExecStart=/etc/systemd/system/webapp
Restart=on-failure
RemainAfterExit=true

# /etc/systemd/system/webapp.target
[Unit]
Wants=webapp.service
vagrant@srv0:~/pystemd$ curl -s localhost:9100/metrics | grep webapp
node_supervisord_exit_status{group="webapp",name="webapp"} 0
node_supervisord_state{group="webapp",name="webapp"} 0
node_supervisord_up{group="webapp",name="webapp"} 0
node_systemd_unit_state{name="webapp.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="active",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="inactive",type="simple"} 1
node_systemd_unit_state{name="webapp.target",state="activating",type=""} 0
node_systemd_unit_state{name="webapp.target",state="active",type=""} 1
node_systemd_unit_state{name="webapp.target",state="deactivating",type=""} 0
node_systemd_unit_state{name="webapp.target",state="failed",type=""} 0
node_systemd_unit_state{name="webapp.target",state="inactive",type=""} 0

mbigras avatar Aug 08 '19 00:08 mbigras

I am having the same Problem. It ist quite difficult to write a proper target unit without Side effects.

B-Lukas avatar Oct 24 '19 09:10 B-Lukas