prometheus-node-exporter-ucode: new modules and extended info
-thermal management (temp and cooling devices) -uname amendment to oneline and help string -realtek-poe -odhcp6c -entropy help strings -time help string -hwmon -time clocksources -watchdog -metrics help strings
📦 Package Details
Maintainer: @dhewg (?) @feckert
🧪 Run Testing Details
- OpenWrt Version:24.10.4 / 25 snapshot
✅ Formalities
- [x] I have reviewed the CONTRIBUTING.md file for detailed contributing guidelines.
Nice, generally this looks good to me! (But I haven't yet checked the metric themselves.)
Thanks for the uname replacement! I knew my solution was lame (I even
added a comment about it), but I wonder why I didn't use your way to
begin with...
The watchdog collector needs to be added to the Makefile.
Some of the next comments may sound nitpicking, but let's optimize each collector for a minimal runtime. Using many on slow devices in a tight polling interval really adds up:
- hwmon and realtek_poe add and just call scrape(), please fold the function content out to the main scope to get rid of it
- don't use snprintf() like expression if not necessary (like realtek_poe's METRIC_NAMESPACE, I'd just fold it into each usage)
- creating metrics of the same name should always happen once in the main scope, and not in a loop (time, watchdog)
- watchdog looks like it could use some cleanup, can the
const metric = { ...}logic applied here? To keep the code simple, it uses a metric map to loop oneline() over it
If you didn't know: you can run selected collectors from the cmdline and
check the node_scrape_collector_duration_seconds metric for the runtime:
e.g.: prometheus-node-exporter-ucode netclass
I did that alot to get each collectors runtime down back then ;)
Wow, micro-optimisations! Yeah, I ran reset ; prometheus-node-exporter-ucode thermal alot. Very helpful aspect :)
- creating metrics of the same name should always happen once in the main scope, and not in a loop (time, watchdog)
So we get ordered results like this? (IIRC this accelerates ingestion?):
# HELP node_cooling_device_max_state Maximum throttle state of the cooling device
# TYPE node_cooling_device_max_state gauge
node_cooling_device_max_state{name="0",type="Processor"} 0
node_cooling_device_max_state{name="1",type="Processor"} 0
node_cooling_device_max_state{name="10",type="Processor"} 0
...
node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="0",mode="user"} 0
node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="1",mode="user"} 0
node_cpu_guest_seconds_total{cpu="10",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="10",mode="user"} 0
...
Using many on slow devices in a tight polling interval really adds up:
Maniacs crushing their routers with 5 seconds of polling :D
The watchdog collector needs to be added to the Makefile.
It's in base, so it's installed automatically. Should it be an extra?
@GeorgeSapkin this commit length problem is not good. See how many characters one has to work with on this module?? :/
@systemcrash like I said before, I'm not married to the exact number. That one was the only thing in the official guidelines, so that's what I went with. If more people feel differently we can change the limit to something else, which I do think is important to have. We don't need to look far for whole commit messages being one long line.
Edit: this also comes back to blocking build if formal fails. Which I'm still on the fence about.
@systemcrash like I said before, I'm not married to the exact number. That one was the only thing in the official guidelines, so that's what I went with. If more people feel differently we can change the limit to something else, which I do think is important to have. We don't need to look far for whole commit messages being one long line.
Edit: this also comes back to blocking build if formal fails. Which I'm still on the fence about.
I'm OK with a limit at the established 72 characters. But I don't see why that should be a hard-limit. A warning perhaps. A hard-limit at like 120-150 likely flags that there're no newlines used anywhere.
@dhewg if you could take another look? I optimised netdev and netclass for ordered output.
pinging @dhewg - think I've addressed everything here.
- creating metrics of the same name should always happen once in the main scope, and not in a loop (time, watchdog)
So we get ordered results like this? (IIRC this accelerates ingestion?):
Sorry, I just meant:
const m = metric(...);
for (...) {
m();
}
instead of:
for (...) {
const m = metric(...);
m();
}
I appreciate the optimizations, but is the ordered output really worth it? Like in terms of time spend by the prometheus daemon to parse the output? It makes the collectors less readable, especially netdev.
The watchdog collector needs to be added to the Makefile.
It's in base, so it's installed automatically. Should it be an extra?
Let's make it an extra. This is what I get one one of my devices:
node_watchdog_info{name="watchdog0",options="null",identity="null",state="null",status="null",pretimeout_governor="null"} 1
So It'll be just baggage there.
On another device I get:
node_watchdog_info{name="watchdog0",options="0x8180",identity="GPIO Watchdog",state="active",status="0x8000",pretimeout_governor="null"} 1
It's been a while, but those null values don't look right, those should probably checked for?
For hwmon, Is this the expected chip?
node_hwmon_chip_names{chip="thermal_thermal_zone0",chip_name="cpu_thermal"} 1
That's from:
ls -l /sys/class/hwmon/hwmon0/
lrwxrwxrwx 1 root root 0 Dec 9 09:02 device -> ../../thermal_zone0
-r--r--r-- 1 root root 4096 Dec 9 09:02 name
drwxr-xr-x 2 root root 0 Dec 9 09:07 power
lrwxrwxrwx 1 root root 0 Dec 9 09:07 subsystem -> ../../../../../class/hwmon
-r--r--r-- 1 root root 4096 Dec 9 09:02 temp1_input
-rw-r--r-- 1 root root 4096 Dec 9 09:07 uevent
For hwmon, Is this the expected
chip?node_hwmon_chip_names{chip="thermal_thermal_zone0",chip_name="cpu_thermal"} 1
Yes, that looks correct. It pulls the chip_name from:
ls -l /sys/class/hwmon/hwmon0/
...
-r--r--r-- 1 root root 4096 Dec 9 09:02 name
Can you check that file content?
Here's what I get from node-exporter v1.9 on an i3:
# HELP node_hwmon_chip_names Annotation metric for human-readable chip names
# TYPE node_hwmon_chip_names gauge
node_hwmon_chip_names{chip="0000:00:1c_4_0000:03:00_0",chip_name="i350bb"} 1
node_hwmon_chip_names{chip="platform_coretemp_0",chip_name="coretemp"} 1
node_hwmon_chip_names{chip="thermal_thermal_zone0",chip_name="acpitz"} 1
vs:
ls -l /sys/class/hwmon/hwmon0/
total 0
lrwxrwxrwx 1 root root 0 Dec 9 13:17 device -> ../../thermal_zone0
-r--r--r-- 1 root root 4096 Dec 9 13:17 name
drwxr-xr-x 2 root root 0 Dec 9 13:17 power
lrwxrwxrwx 1 root root 0 Dec 9 13:17 subsystem -> ../../../../../class/hwmon
-r--r--r-- 1 root root 4096 Dec 9 13:17 temp1_input
-r--r--r-- 1 root root 4096 Dec 9 13:17 temp2_input
-rw-r--r-- 1 root root 4096 Dec 9 13:17 uevent
# cat /sys/class/hwmon/hwmon0/name
acpitz
is the ordered output really worth it? Like in terms of time spend by the prometheus daemon to parse the output? It makes the collectors less readable, especially
netdev.
It makes output less jumbled and more readable (since HELP strings occur with the TYPE, they're grouped together and that type appears once, rather than only appearing the first time in multiple appearances of that row type later on), and it contributes to lower data-cardinality at ingestion, which IIRC helps coalesce writes.
I think it makes the code more readable - since the loops are distinct and not nested.
OK. It's an extra now. Much appreciated if you could take hwmon for another spin @dhewg
ping @dhewg for final test run
Sorry for the delay, pre-xmas stress...
Well, I agree to not modify the whitespaces! But then please revert the whitespace changes in netclass/netdev, because that is now inconsistent. I'm not married to whatever I used initially, but let's not bikeshed this please.
is the ordered output really worth it? Like in terms of time spend by the prometheus daemon to parse the output?
I think it makes the code more readable - since the loops are distinct and not nested.
Sorry, still disagreeing. While it's not rocket science, it's just more code to read. In netdev we now have 4 loops instead of one. You even added comments to make it clear what it does. It wasn't unclear before at all. It's similar for most of the new files.
So again: Is this reduced readability really worth it? Don't get me wrong, it may as well be! I just don't know and don't have the time to dig in atm. But on the other side I can't see a prometheus daemon really sucking that much if not every collector on earth is optimized in this specific way?
As for hwmon/watchdog, the null values are gone, they're empty string now.
Remind me, is that what prometheus wants? Or should we just not output empty labels?
Don't get me wrong, it may as well be! I just don't know and don't have the time to dig in atm.
(my) Reasoning being that we output ordered, rather than revisiting the same metrics for different devices. It makes stare and compare in the scrape output clearer, among other downstream benefits.
As for hwmon/watchdog, the
nullvalues are gone, they're empty string now. Remind me, is that what prometheus wants? Or should we just not output empty labels?
I'll take another look at that. Empty isn't bad per-se, but we might as well avoid the output altogether.
could you paste an example of the output? Edit: never mind. Found it :)
(my) Reasoning being that we output ordered, rather than revisiting the same metrics for different devices. It makes stare and compare in the scrape output clearer, among other downstream benefits.
I got that, but why should scrape output readability trump code readability? Debugging comes to mind, but we can go just use our standard tools like prometheus-node-exporter-ucode hwmon|grep foo|sort|cut|awk|whatever
"Other downstream benefits" may be the reason to go this way, but that's a bit too hand-wavy argument for my taste ;)
Here's watchdog on an OpenWrt One, we got some (harmless) warnings and an empty pretimeout_governor:
prometheus-node-exporter-ucode watchdog
prometheus-node-exporter-ucode now serving requests with 16 collectors
Status: 200 OK
Content-Type: text/plain; version=0.0.4; charset=utf-8
# HELP node_watchdog_bootstatus Value of /sys/class/watchdog/<watchdog>/bootstatus
# TYPE node_watchdog_bootstatus gauge
node_watchdog_bootstatus{name="watchdog0"} 0
node_watchdog_bootstatus{name="watchdog1"} 0
# HELP node_watchdog_fw_version Value of /sys/class/watchdog/<watchdog>/fw_version
# TYPE node_watchdog_fw_version gauge
node_watchdog_fw_version{name="watchdog0"} 0
node_watchdog_fw_version{name="watchdog1"} 0
# HELP node_watchdog_nowayout Value of /sys/class/watchdog/<watchdog>/nowayout
# TYPE node_watchdog_nowayout gauge
node_watchdog_nowayout{name="watchdog0"} 0
node_watchdog_nowayout{name="watchdog1"} 0
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds)
# HELP node_watchdog_timeout_seconds Value of /sys/class/watchdog/<watchdog>/timeout
# TYPE node_watchdog_timeout_seconds gauge
node_watchdog_timeout_seconds{name="watchdog0"} 30
node_watchdog_timeout_seconds{name="watchdog1"} 31
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_pretimeout_seconds)
# HELP node_watchdog_pretimeout_seconds Value of /sys/class/watchdog/<watchdog>/pretimeout
# TYPE node_watchdog_pretimeout_seconds gauge
node_watchdog_pretimeout_seconds{name="watchdog1"} 15
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0)
# HELP node_watchdog_info Info of /sys/class/watchdog/<watchdog>
# TYPE node_watchdog_info gauge
node_watchdog_info{name="watchdog0",options="0x8180",identity="GPIO Watchdog",state="active",status="0x8000",pretimeout_governor=""} 1
node_watchdog_info{name="watchdog1",options="0x8380",identity="mtk-wdt",state="inactive",status="0x0",pretimeout_governor="panic"} 1
# HELP node_watchdog_available Info of /sys/class/watchdog/<watchdog>/pretimeout_available_governors
# TYPE node_watchdog_available gauge
node_watchdog_available{available="panic",device="watchdog1"} 1
# TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="watchdog"} 0.005309986
# TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="watchdog"} 1
avm,fritzbox-7530 looks even more funny:
$ prometheus-node-exporter-ucode watchdog|grep \"\"
prometheus-node-exporter-ucode now serving requests with 20 collectors
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_bootstatus)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_fw_version)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_nowayout)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeout_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_pretimeout_seconds)
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0)
node_watchdog_info{name="watchdog0",options="",identity="",state="",status="",pretimeout_governor=""} 1
`
(my) Reasoning being that we output ordered, rather than revisiting the same metrics for different devices. It makes stare and compare in the scrape output clearer, among other downstream benefits.
I got that, but why should scrape output readability trump code readability? Debugging comes to mind, but we can go just use our standard tools like
prometheus-node-exporter-ucode hwmon|grep foo|sort|cut|awk|whatever"Other downstream benefits" may be the reason to go this way, but that's a bit too hand-wavy argument for my taste ;)
Fair. Well, when you're comparing with Prometheus node_exporter, it helps quite a bit.
Here's
watchdogon an OpenWrt One, we got some (harmless) warnings and an emptypretimeout_governor:
Ah, thanks!
DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds) DEBUG: skipping metric: unsupported value 'null' (node_watchdog_timeleft_seconds) DEBUG: skipping metric: unsupported value 'null' (node_watchdog_pretimeout_seconds) DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0) DEBUG: skipping metric: unsupported value 'null' (node_watchdog_access_cs0)
These should all be resolved with the latest push.
# HELP node_watchdog_info Info of /sys/class/watchdog/<watchdog> # TYPE node_watchdog_info gauge node_watchdog_info{name="watchdog0",options="0x8180",identity="GPIO Watchdog",state="active",status="0x8000",pretimeout_governor=""} 1
As should this, without things like pretimeout_governor="". Should be fixed.
The same fixes should also work on the Fritz. We now skip it if the path is unavailable.