OpenMetrics/Prometheus support
While SNMP support is nice, it's complicated and a bit old-school for most users.
The OpenMetrics/Prometheus protocol is simple and easy to implement when you already have an HTTP server. It's self describing, no MIBs required.
Would this be something the project would like to have?
Thanks for the suggestion. Quick note: Prometheus/OpenMetrics is a text exposition format over HTTP, not a new protocol.
From an end-user perspective SNMP isn’t complicated. You can script snmpget/snmpwalk or point Telegraf at it and you’re done. The hard part was implementing the SNMP agent on the embedded side, which I already did.
I’m open to adding an optional /metrics endpoint alongside SNMP, just not immediately. I’m overloaded with ENERGIS finalization tasks and also ramped up PDNode this week as promised in an earlier issue. Once things settle, I can start with a minimal set and expand if needed.
It would be really helpful if you could list the exact metrics you want scraped (names, labels, units). If you have a preferred structure, share it and I’ll align. PRs are welcome too.
Btw, this was freshly uploaded. Careful, it’s still hot :D You can check what automation/control possibilities are available: https://dvidmakesthings.github.io/HW_10-In-Rack_PDU/Manuals/AUTOMATION_MANUAL.pdf
I will see if I can go over the available data and see what makes sense to include in a metrics endpoint.
I'm not much of a C dev, last time I did serious stuff in C was in the '90s. I spend most of my time writing Go. But if I find some time I can try and start plumbing in a metrics endpoint.
Would you prefer a hand-rolled endpoint, or maybe a library?
You can script snmpget/snmpwalk or point Telegraf at it and you’re done
I disagree with "and you're done", since having data and understanding the data is very complicated in SNMP. I say this as someone who does a lot of end-user SNMP support in monitoring. (I also maintain the SNMP library that Telegraf uses)
A hand‑rolled /metrics endpoint should be pretty straightforward to add alongside the existing HTML. If you have a preferred structure or metric set in mind, feel free to share it and I’ll align.
@SuperQ I've made a preview of the hand-rolled /metrics endpoint. I’ve uploaded a small README section and a sample output metrics.txt.
It includes system info, calibrated temp + rails, and per-channel state/voltage/current/power.
Is this something that you imagined?
Oh, nice. Thanks, I'll see if I can build on it.
A couple of nits:
This should end in _total to follow naming best practices
# HELP energis_uptime_seconds System uptime in seconds.
# TYPE energis_uptime_seconds counter
energis_uptime_seconds 27
Normally I would recommend energis_boot_time_seconds as a UNIX epoch timestamp. But I guess this could be tricky given it seems like there's no RTC battery. At least not that I can see from the schematic.
This shouldn't be necessary. If the metrics endpoint fails to fetch data, it should return a 5xx error.
# HELP energis_up 1 if the metrics handler is healthy.
# TYPE energis_up gauge
energis_up 1
Is it possible to also expose watt-hours / joules as a counter? This is better for computing watts since any small peaks between scrapes will be fully measured.
# HELP energis_channel_power_watts Active power per channel.
# TYPE energis_channel_power_watts gauge
energis_channel_power_watts{ch="1"} 0.000
Values are served from cached snapshots owned by MeterTask
Prometheus typically recommends against this, you want to serve the freshest data possible. I can understand IO locking may be an issue, so it's probably OK for some of this data. As long as it's updated frequently (1sis probably a good cache TTL). The typical configuration for Prometheus is to scrape every 15s.
Got your notes, thanks.
- I renamed
energis_uptime_secondstoenergis_uptime_seconds_total. - No
*_boot_time_secondshere; there’s no RTC, so an epoch would be made up. I haven't even seen any PDU with RTC - I’m keeping
energis_upit doesn't hurt if it's there. Someone might find it useful. - The handler now fails hard: if render blows the buffer, it returns 503 so Prometheus sees a real error.
- Freshness: Ehh, that's a tough one. The endpoint serves the latest cached snapshot. System monitors update ~200 ms, but power is limited by HLW8032. The only read speed is 4800 baud with this IC, so a full 8-channel sweep is ~1.5 s. Since no blocking I/O allowed in the handler
- I also added
energis_channel_energy_watt_hours_total
Pushed the changes