Prometheus Metrics: `opendtu_last_update` wraps to 0 at approx. 50 days, while `opendtu_uptime` does not
What happened?
Both opendtu_uptime and opendtu_last_update are counters in seconds, relative to the last reboot of the device.
But they seem to use different data types, since the opendtu_last_update wraps to zero at around 4.2 mio seconds , while the opendtu_uptime continues to increase beyond that.
When this happens, the difference opendtu_uptime - opendtu_last_update, which is the "amount of seconds since the last update", is not correct anymore.
Assumption: opendtu_last_update is millisecs internally, and so the wrap point is 4294967295 milliseconds (max uint32 / 1000).
To Reproduce Bug
- reboot the device
- wait 4294967295 milliseconds (around 50 days), never update or reboot
- curl /api/prometheus/metrics
- compare opendtu_last_update and opendtu_uptime. uptime is > 4294967 seconds, last update is close to 0
Expected Behavior
Both counters use the same numeric data type so they overflow at the same time. Alternative: add a gauge metric that emits the "seconds since last update" directly, so I do not have to compute it.
Install Method
Pre-Compiled binary from GitHub
What git-hash/version of OpenDTU?
v24.5.6
Relevant log/trace output
No response
Anything else?
Side note: Over 50 days without a crash or reboot -- just this minor glitch. Solid software, good job 🚀
Please confirm the following
- [X] I believe this issue is a bug that affects all users of OpenDTU, not something specific to my installation.
- [X] I have already searched for relevant existing issues and discussions before opening this report.
- [X] I have updated the title field above with a concise description.
- [X] I have double checked that my inverter does not contain a W in the model name (like HMS-xxxW) as they are not supported
I think the assumption that this is a 32-bit uint overflow update is correct.
The prometheus metrics are set in https://github.com/tbnobody/OpenDTU/blob/3dc70ab40aade8b7eb9ed9a1c6605ca326299d18/src/WebApi_prometheus.cpp#L77
getLastUpdate() returns _lastUpdate: https://github.com/tbnobody/OpenDTU/blob/3dc70ab40aade8b7eb9ed9a1c6605ca326299d18/lib/Hoymiles/src/parser/Parser.cpp#L13-L16
_lastUpdate is a uint32: https://github.com/tbnobody/OpenDTU/blob/3dc70ab40aade8b7eb9ed9a1c6605ca326299d18/lib/Hoymiles/src/parser/Parser.h#L30
Seems dumb to ask, but this is still a problem in current builds, right?
I need to reboot every 50 days, to make Prometheus work again with queries like the following:
avg by (name, type) (opendtu_Irradiation{} unless on (name, serial) scalar(opendtu_uptime) - opendtu_last_update > 30)