node_exporter
node_exporter copied to clipboard
[node_exporter][metric] node_systemd_unit_state with labels: "high-level unit activation state" and "low-level unit activation state"
Host operating system: output of uname -a
Linux server01 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of node_exporter --version
node_exporter, version 0.18.1 (branch: HEAD, revision: 3db77732e925c08f675d7404a8c46466b2ece83e) build user: root@b50852a1acba build date: 20190604-16:41:18 go version: go1.12.5
node_exporter command line flags
usage: node_exporter [<flags>]
Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p)\\d+$"
Regexp of devices to ignore for diskstats.
--collector.filesystem.ignored-mount-points="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
Regexp of mount points to ignore for filesystem collector.
--collector.filesystem.ignored-fs-types="^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
Regexp of filesystem types to ignore for filesystem collector.
--collector.netclass.ignored-devices="^$"
Regexp of net devices to ignore for netclass collector.
--collector.netdev.ignored-devices="^$"
Regexp of net devices to ignore for netdev collector.
--collector.netstat.fields="^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans)|Tcp_(ActiveOpens|InSegs|OutSegs|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts))$"
Regexp of fields to return for netstat collector.
--collector.ntp.server="127.0.0.1"
NTP server to use for ntp collector
--collector.ntp.protocol-version=4
NTP protocol version
--collector.ntp.server-is-local
Certify that collector.ntp.server address is the same local host as this collector.
--collector.ntp.ip-ttl=1 IP TTL to use while sending NTP query
--collector.ntp.max-distance=3.46608s
Max accumulated distance to the root
--collector.ntp.local-offset-tolerance=1ms
Offset between local clock and local ntpd time to tolerate
--path.procfs="/proc" procfs mountpoint.
--path.sysfs="/sys" sysfs mountpoint.
--path.rootfs="/" rootfs mountpoint.
--collector.qdisc.fixtures=""
test fixtures to use for qdisc collector end-to-end testing
--collector.runit.servicedir="/etc/service"
Path to runit service directory.
--collector.supervisord.url="http://localhost:9001/RPC2"
XML RPC endpoint.
--collector.systemd.unit-whitelist=".+"
Regexp of systemd units to whitelist. Units must both match whitelist and not match blacklist to be included.
--collector.systemd.unit-blacklist=".+\\.(automount|device|mount|scope|slice)"
Regexp of systemd units to blacklist. Units must both match whitelist and not match blacklist to be included.
--collector.systemd.private
Establish a private, direct connection to systemd without dbus.
--collector.systemd.enable-task-metrics
Enables service unit tasks metrics unit_tasks_current and unit_tasks_max
--collector.systemd.enable-restarts-metrics
Enables service unit metric service_restart_total
--collector.systemd.enable-start-time-metrics
Enables service unit metric unit_start_time_seconds
--collector.textfile.directory=""
Directory to read text files with metrics from.
--collector.vmstat.fields="^(oom_kill|pgpg|pswp|pg.*fault).*"
Regexp of fields to return for vmstat collector.
--collector.wifi.fixtures=""
test fixtures to use for wifi collector metrics
--collector.arp Enable the arp collector (default: enabled).
--collector.bcache Enable the bcache collector (default: enabled).
--collector.bonding Enable the bonding collector (default: enabled).
--collector.buddyinfo Enable the buddyinfo collector (default: disabled).
--collector.conntrack Enable the conntrack collector (default: enabled).
--collector.cpu Enable the cpu collector (default: enabled).
--collector.cpufreq Enable the cpufreq collector (default: enabled).
--collector.diskstats Enable the diskstats collector (default: enabled).
--collector.drbd Enable the drbd collector (default: disabled).
--collector.edac Enable the edac collector (default: enabled).
--collector.entropy Enable the entropy collector (default: enabled).
--collector.filefd Enable the filefd collector (default: enabled).
--collector.filesystem Enable the filesystem collector (default: enabled).
--collector.hwmon Enable the hwmon collector (default: enabled).
--collector.infiniband Enable the infiniband collector (default: enabled).
--collector.interrupts Enable the interrupts collector (default: disabled).
--collector.ipvs Enable the ipvs collector (default: enabled).
--collector.ksmd Enable the ksmd collector (default: disabled).
--collector.loadavg Enable the loadavg collector (default: enabled).
--collector.logind Enable the logind collector (default: disabled).
--collector.mdadm Enable the mdadm collector (default: enabled).
--collector.meminfo Enable the meminfo collector (default: enabled).
--collector.meminfo_numa Enable the meminfo_numa collector (default: disabled).
--collector.mountstats Enable the mountstats collector (default: disabled).
--collector.netclass Enable the netclass collector (default: enabled).
--collector.netdev Enable the netdev collector (default: enabled).
--collector.netstat Enable the netstat collector (default: enabled).
--collector.nfs Enable the nfs collector (default: enabled).
--collector.nfsd Enable the nfsd collector (default: enabled).
--collector.ntp Enable the ntp collector (default: disabled).
--collector.perf Enable the perf collector (default: disabled).
--collector.pressure Enable the pressure collector (default: enabled).
--collector.processes Enable the processes collector (default: disabled).
--collector.qdisc Enable the qdisc collector (default: disabled).
--collector.runit Enable the runit collector (default: disabled).
--collector.sockstat Enable the sockstat collector (default: enabled).
--collector.stat Enable the stat collector (default: enabled).
--collector.supervisord Enable the supervisord collector (default: disabled).
--collector.systemd Enable the systemd collector (default: disabled).
--collector.tcpstat Enable the tcpstat collector (default: disabled).
--collector.textfile Enable the textfile collector (default: enabled).
--collector.time Enable the time collector (default: enabled).
--collector.timex Enable the timex collector (default: enabled).
--collector.uname Enable the uname collector (default: enabled).
--collector.vmstat Enable the vmstat collector (default: enabled).
--collector.wifi Enable the wifi collector (default: disabled).
--collector.xfs Enable the xfs collector (default: enabled).
--collector.zfs Enable the zfs collector (default: enabled).
--web.listen-address=":9100"
Address on which to expose metrics and web interface.
--web.telemetry-path="/metrics"
Path under which to expose metrics.
--web.disable-exporter-metrics
Exclude metrics about the exporter itself (promhttp_*, process_*, go_*).
--web.max-requests=40 Maximum number of parallel scrape requests. Use 0 to disable.
--log.level="info" Only log messages with the given severity or above. Valid levels: [debug, info, warn, error, fatal]
--log.format="logger:stderr"
Set the log target and format. Example: "logger:syslog?appname=bob&local=7" or "logger:stdout?json=true"
--version Show application version.
Are you running node_exporter in Docker?
N/A
What did you do that produced an error?
It will be great to have the label SUB (low-level unit activation state) exposed by node_exporter for node_systemd_unit metric (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sect-managing_services_with_systemd-services).
$ systemctl list-units --type service
UNIT LOAD ACTIVE SUB DESCRIPTION
node_exporter.service loaded active running Prometheus Node Exporter
$ systemctl status node_exporter -l
● node_exporter.service - Prometheus Node Exporter
Loaded: loaded (/usr/lib/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2019-07-01 10:39:36 CEST; 1 day 1h ago
Main PID: 13320 (node_exporter)
What did you expect to see?
It will be great to see the "high-level unit activation state" (ACTIVE) and "low-level unit activation state" (SUB) as labels on metric: node_systemd_unit_state (for the moment there is only the state without substate), below I've added the label.
node_systemd_unit_state{alias="server01",env="int",instance="192.168.11.11:9100",job="server01",name="node_exporter.service",state="active",substate="running",type="simple"}
What did you see?
node_systemd_unit_state{alias="server01",env="int",instance="192.168.11.11:9100",job="server01",name="node_exporter.service",state="active",type="simple"}
While this may seem trivial at first glance, it's a lot more complicated. The combination of unit states and substates is quite a long list.
From systemctl --state=help:
Available unit load states:
stub
loaded
not-found
error
merged
masked
Available unit active states:
active
reloading
inactive
failed
activating
deactivating
Available automount unit substates:
dead
waiting
running
failed
Available device unit substates:
dead
tentative
plugged
Available mount unit substates:
dead
mounting
mounting-done
mounted
remounting
unmounting
remounting-sigterm
remounting-sigkill
unmounting-sigterm
unmounting-sigkill
failed
Available path unit substates:
dead
waiting
running
failed
Available scope unit substates:
dead
running
abandoned
stop-sigterm
stop-sigkill
failed
Available service unit substates:
dead
start-pre
start
start-post
running
exited
reload
stop
stop-sigabrt
stop-sigterm
stop-sigkill
stop-post
final-sigterm
final-sigkill
failed
auto-restart
Available slice unit substates:
dead
active
Available socket unit substates:
dead
start-pre
start-chown
start-post
listening
running
stop-pre
stop-pre-sigterm
stop-pre-sigkill
stop-post
final-sigterm
final-sigkill
failed
Available swap unit substates:
dead
activating
activating-done
active
deactivating
deactivating-sigterm
deactivating-sigkill
failed
Available target unit substates:
dead
active
Available timer unit substates:
dead
waiting
running
elapsed
failed
In order to do this correctly, we have to expand the current state bitmask into the full combination of sub-states. Even with this help info, the valid state + sub-state combinations aren't mapped. For example is failed + running a valid combination?
We also need to detect which type of unit each one is and only expose the sub-states that are valid for that type.
This might make a better separate metric, node_systemd_unit_substate. This would simplify dealing with the valid combinations.
Sounds great, thank you for this detailed explanation. At the begining, I was thinking if possible to have only two scenarios like:
node_systemd_unit_state{alias="server01",env="int",instance="192.168.11.11:9100",job="server01",name="node_exporter.service",state="active",substate="running",type="simple"}
and
To put under substate="failed" all the substates != substate="running". {alias="server01",env="int",instance="192.168.11.11:9100",job="server01",name="node_exporter.service",state="active",substate="failed",type="simple"}
But what are you detailed I think makes more sense.
Hi @discordianfish any news? :)
Not that I'm aware of. We're open for submissions to implement that but I don't think anyone has done something to address this.
Hi @discordianfish , @SuperQ, maybe in future releases of node_exporter will have this.
I think we're open to including this so if you want to implement this, we'll consider it
I am interested in discussing this issue. The status of my system is as follows.
[root@localhost ~]# systemctl is-enabled node_exporter
disabled
[root@localhost ~]# systemctl list-units --type service
UNIT LOAD ACTIVE SUB DESCRIPTION
● node_exporter.service loaded failed failed Prometheus Node Exporter
[root@localhost ~]# systemctl status node_exporter
● node_exporter.service - Prometheus Node Exporter
Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2023-03-21 19:28:28 KST; 1 day 15h ago
Main PID: 8572 (code=exited, status=1/FAILURE)
...
In this status, an alert is triggerd by the following rule, which we do not want.
node_systemd_unit_state{state="failed",type!="oneshot"} == 1
It would be good if we could prevent the alert using expressions like:
node_systemd_unit_state{state!="disabled",substate="failed",type!="oneshot"} == 1
node_systemd_unit_state{state!="inactive",substate="failed",type!="oneshot"} == 1