node_exporter
node_exporter copied to clipboard
Systemd service metrics missing when loaded but disabled
Host operating system: output of uname -a
3.10.0-862.11.6.el7.x86_64 Red Hat Enterprise Linux Server release 7.5 (Maipo)
node_exporter version: output of node_exporter --version
node_exporter, version 0.16.0 (branch: HEAD, revision: d42bd70f4363dced6b77d8fc311ea57b63387e4f) build user: root@a67a9bc13a69 build date: 20180515-15:52:42 go version: go1.9.6
node_exporter command line flags
ExecStart=/home/prometheus/node_exporter/node_exporter --collector.systemd
Are you running node_exporter in Docker?
No
What did you do that produced an error?
Have a custom systemd service defined in /etc/systemd/system for the Keepalived daemon. Running the following query returns the expected results with all five defined states: node_systemd_unit_state{instance="x.x.x.x:9100",name="keepalived.service"}
node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="activating"} 0 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="active"} 1 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="deactivating"} 0 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="failed"} 0 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="inactive"} 0
Once i issue a sudo systemctl stop keepalived.service and run the query again, then the prometheus returns nothing. It is as if the service was never defined. I run the query and don't filter on job name and every other service is returned. Once I start the service, the metrics will return again.
What did you expect to see?
Expected to continue to see the states returned, but with Active=0 and Inactive=1
What did you see instead?
No metrics for the service were returned period. Nothing. Blank screen. I have an a secondary server which I thought was an exact mirror image of the server exhibiting the issue and it does not experience this problem. Thank you for everyone's time.
Just to add more information, this appears to be an issue with user defined services only as the default system services do not disappear after stopping.
That's very strange. Prometheus is doing fairly simple requests for data over dbus. Perhaps it's a systemd bug?
Maybe a bug with ListUnitsFiltered
dbus request. We could try going back to ListUnits
, and filter the loaded ones out in the exporter instead of trusting systemd.
@hueyg Can you verify https://github.com/prometheus/node_exporter/pull/1083 fixes the issue?
@SuperQ Hey Ben, I apologize but I am a pretty ignorant Github user. I can see where it looks like you have updated the GO code to perform some more checks on the state of the units defined in SystemD. Does this mean that you want me to recompile a new version of node_explorer with this updated code and try again?
@hueyg Yes, if you can checkout the code, and build it, that would help. Otherwise I can post a binary if you trust me. :grin:
You can follow the standard build instructions but run git checkout superq/systemd_filter
before you run make build
.
In my test GCE instance of CentOS 7.5, I see this difference in metrics: Before:
node_systemd_units{state="active"} 101
node_systemd_units{state="inactive"} 58
After:
node_systemd_units{state="active"} 101
node_systemd_units{state="inactive"} 75
But I don't see a difference in the number of unique units in node_systemd_unit_state
, which I only see 160 of. Very strange.
@SuperQ I have no problem if you want to post a compiled binary, but I will work on it in the meantime. I am really pushing to get this resolved because this is a show stopper for the project. It definitely seems related to custom defined services. What has me totally confused is that this problem is not exhibited on what I am pretty sure is an identical secondary server. This is group of two HAProxy servers with custom units defined for HAProxy and KeepaliveD.
@SuperQ I think I found the issue Ben. Give me ten more minutes.
@SuperQ First let me apologize if this is a known issue/requirement but the difference is that the custom defined unit was specifically "enabled" on the working server and "disabled" by default on the non-working server. My limited understanding of SystemD is that this simply means whether the unit was set to start at runtime or not. So the "STATE" of the until file itself plays a role in what node_exporter can see. Once I set the custom service to enable: sudo systemctl enable haproxy.service
Once that command is issued the service will still be returned from node_explorer after being issued a stop command.
Interesting, I thought for sure that even a disabled but running service would show up.
@lucab Do you have any idea why a service like this wouldn't show up in ListUnits()
:
# systemctl status chronyd.service
● chronyd.service - NTP client/server
Loaded: loaded (/etc/systemd/system/chronyd.service; disabled; vendor preset: enabled)
Active: inactive (dead)
@hueyg I did some additional testing, it doesn't seem to matter if the stopped/disabled unit is in /etc/systemd/system or /usr/lib/systemd/system. It fails to show up when stopped/disabled.
@SuperQ I think that's because the unit is inactive and disabled. Additionally, I fear that at some point the ListUnits
DBus method may have changed semantics as its documentations mentions "loaded units" everywhere (or either the doc or the code is wrong). My suggestion would be to try using ListUnitsFiltered
instead.
@lucab It seems like ListUnitsFiltered()
has the same problem. systemctl status
says it's loaded, but when we ask for the "loaded" list, inactive/disabled are missing.
I double checked, same problem with ListUnitsFiltered([]string{})
. "loaded" but disabled are not returned.
@SuperQ which OS and systemd version are you seeing this on (OP was on RHEL7.5)? I can carry this over to a go-systemd ticket and have a look as soon as I have time.
@lucab I was testing with CentOS 7.5 and Ubuntu 18.04 (systemd 237).
Thanks, feel free to ping me on the go-systemd upstream.
Getting back to this, it looks like systemctl status
is tricking us due to ephemerally loading the observed unit, which is instead not loaded right before or after the observation (being disabled and inactive).
It looks like there are DBus methods to get the "enabled state" for unit files, and to get the "active state" for enabled units, but I don't think there is a single method to get the union of those (and the primary keys are different object-types).
In the end, I think this boils down to semantics. This collector is in fact reporting activation state for loaded units, but that set is dynamic and also influenced by observers.
Thanks, one option is we could use the list unit files function.
@SuperQ I'm planning to do some PoC for this later tonight (using ListUnitFiles as the base instead of ListUnits) if you're not already working on it?
@miono No, I haven't started on it. Looking forward to the PoC. Thanks!
So my initial thought was to keep the call to ListUnits and add another call to ListUnitFiles. Then diffing the loaded units with the unit-files.
By adding a bool to the unit-struct called "enabled" or something we could just add the disabled unit-files as unit-structs with 0's in (activating|active|deactivating|failed|inactive) and also populating this field for the loaded units with the data we get from ListUnitFiles.
However:
-
This diff would risk not being 100 % accurate, since only the last part of the unitfile-path would be diffable with the list of loaded units (For example /lib/systemd/system/blabla.service and /etc/systemd/system/blabla.service would show up twice in the call from ListUnitFiles, I guess we would need to implement the same priority systemd is using for loading unit-files to check if the correct unit-file is enabled or not).
-
More importantly: This would conflict with the intent of #567 since we would start exposing metrics for stuff that shouldn't be enabled. In commit 0fdc0891
&& unit.LoadState == "loaded"
was added, to only show loaded units.
That behaviour would be more desirable for us at my workplace, since we're using a whitelist-parameter. But of course everyone aren't us.
What is the desired behaviour? My two cents is that it's confusing with metrics that suddenly disappear, it can also cause some problems if you're alerting on active = 1
and then there's no such metric, and no alert, if a by mistake-disabled service stops running.
I have the same issue when running node exporter of the latest version in a container -- nginx and postgresql service statuses are not returned at all when these services are stopped but when you start them node-exporter shows all possible statuses with 1 on the active status however mysqld service is shown correctly-when it's stopped the statuses are returned with 1 showing on inactive
In my case I will workaround for now by using blackbox exporter to query endpoint, but that's not exactly the same as checking if process is up and in many cases process may not have any endpoints, hope it will be fixed some time soon
@mlushpenko The best option is to have a Prometheus /metrics
endpoint on the service. This provides both the blackbox check and service status, eliminating the need for watching systemd at all. :smile:
Hi @SuperQ , is there any fix on this? I still notice that in NodeExporter v0.17.0 this issue still exists. Sincerely thanks...
There is no current fix, because systemd does not provide the required information over dbus.
I asked how to keep a unit loaded even when stopped here: https://github.com/systemd/systemd/issues/5063#issuecomment-518456418
This is the response I got:
https://github.com/systemd/systemd/issues/5063#issuecomment-518553166
Use RefUnit() via the bus to continuously reference a unit. In that case it stays loaded until you call UnrefUnit(), or disconnect from the bus, and no other reason is in place to keep it loaded. RefUnit() is available to privileged clients only and since v232 (i.e. ~2016)
I looked at the code
https://github.com/systemd/systemd/blob/f3d3a9ca0734c298cc3bf08f8c4907dd19ee9939/src/core/dbus-manager.c#L2488
https://github.com/systemd/systemd/blob/f3d3a9ca0734c298cc3bf08f8c4907dd19ee9939/src/core/dbus-manager.c#L654
https://github.com/systemd/systemd/blob/f3d3a9ca0734c298cc3bf08f8c4907dd19ee9939/src/core/dbus-manager.c#L567
But I'm still not sure how to use RefUnit in a systemd unit file.
Are these links helpful?
My work around to get on, off, failed state information from my process manager to my metrics system is to use supervisor instead.
vagrant@srv0:~$ sudo systemctl start app{1,2,3}
vagrant@srv0:~$ sudo supervisorctl start app{1,2,3}
app1: started
app2: started
app3: started
vagrant@srv0:~$ sudo systemctl stop app3
vagrant@srv0:~$ sudo supervisorctl stop app3
app3: stopped
vagrant@srv0:~$ curl -s localhost:9100/metrics | grep 'app[123]' | grep state
node_supervisord_state{group="app1",name="app1"} 20
node_supervisord_state{group="app2",name="app2"} 20
node_supervisord_state{group="app3",name="app3"} 0
node_systemd_unit_state{name="app1.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="app1.service",state="active",type="simple"} 1
node_systemd_unit_state{name="app1.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="app1.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="app1.service",state="inactive",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="active",type="simple"} 1
node_systemd_unit_state{name="app2.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="inactive",type="simple"} 0
Looking more closely at another comment
https://github.com/systemd/systemd/issues/5063#issuecomment-518524231
A service will stay loaded if it is wanted/required/etc by something...
I was able to keep a reference to an inactive unit by creating a webapp.service and a webapp.target that wants the webapp.service
This looks like it works!
systemctl cat webapp.{service,target}
# /etc/systemd/system/webapp.service
[Service]
User=webapp
ExecStart=/etc/systemd/system/webapp
Restart=on-failure
RemainAfterExit=true
# /etc/systemd/system/webapp.target
[Unit]
Wants=webapp.service
vagrant@srv0:~/pystemd$ curl -s localhost:9100/metrics | grep webapp
node_supervisord_exit_status{group="webapp",name="webapp"} 0
node_supervisord_state{group="webapp",name="webapp"} 0
node_supervisord_up{group="webapp",name="webapp"} 0
node_systemd_unit_state{name="webapp.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="active",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="inactive",type="simple"} 1
node_systemd_unit_state{name="webapp.target",state="activating",type=""} 0
node_systemd_unit_state{name="webapp.target",state="active",type=""} 1
node_systemd_unit_state{name="webapp.target",state="deactivating",type=""} 0
node_systemd_unit_state{name="webapp.target",state="failed",type=""} 0
node_systemd_unit_state{name="webapp.target",state="inactive",type=""} 0
I am having the same Problem. It ist quite difficult to write a proper target unit without Side effects.