OpenDTU icon indicating copy to clipboard operation
OpenDTU copied to clipboard

Prometheus Metrics: `opendtu_last_update` wraps to 0 at approx. 50 days, while `opendtu_uptime` does not

Open easimon opened this issue 1 year ago • 2 comments

What happened?

Both opendtu_uptime and opendtu_last_update are counters in seconds, relative to the last reboot of the device. But they seem to use different data types, since the opendtu_last_update wraps to zero at around 4.2 mio seconds , while the opendtu_uptime continues to increase beyond that.

When this happens, the difference opendtu_uptime - opendtu_last_update, which is the "amount of seconds since the last update", is not correct anymore.

Assumption: opendtu_last_update is millisecs internally, and so the wrap point is 4294967295 milliseconds (max uint32 / 1000).

To Reproduce Bug

  • reboot the device
  • wait 4294967295 milliseconds (around 50 days), never update or reboot
  • curl /api/prometheus/metrics
  • compare opendtu_last_update and opendtu_uptime. uptime is > 4294967 seconds, last update is close to 0

Expected Behavior

Both counters use the same numeric data type so they overflow at the same time. Alternative: add a gauge metric that emits the "seconds since last update" directly, so I do not have to compute it.

Install Method

Pre-Compiled binary from GitHub

What git-hash/version of OpenDTU?

v24.5.6

Relevant log/trace output

No response

Anything else?

Side note: Over 50 days without a crash or reboot -- just this minor glitch. Solid software, good job 🚀

Please confirm the following

  • [X] I believe this issue is a bug that affects all users of OpenDTU, not something specific to my installation.
  • [X] I have already searched for relevant existing issues and discussions before opening this report.
  • [X] I have updated the title field above with a concise description.
  • [X] I have double checked that my inverter does not contain a W in the model name (like HMS-xxxW) as they are not supported

easimon avatar Aug 24 '24 15:08 easimon

I think the assumption that this is a 32-bit uint overflow update is correct.

The prometheus metrics are set in https://github.com/tbnobody/OpenDTU/blob/3dc70ab40aade8b7eb9ed9a1c6605ca326299d18/src/WebApi_prometheus.cpp#L77

getLastUpdate() returns _lastUpdate: https://github.com/tbnobody/OpenDTU/blob/3dc70ab40aade8b7eb9ed9a1c6605ca326299d18/lib/Hoymiles/src/parser/Parser.cpp#L13-L16

_lastUpdate is a uint32: https://github.com/tbnobody/OpenDTU/blob/3dc70ab40aade8b7eb9ed9a1c6605ca326299d18/lib/Hoymiles/src/parser/Parser.h#L30

morremeyer avatar Nov 12 '24 15:11 morremeyer

Seems dumb to ask, but this is still a problem in current builds, right?

I need to reboot every 50 days, to make Prometheus work again with queries like the following:

avg by (name, type) (opendtu_Irradiation{} unless on (name, serial) scalar(opendtu_uptime) - opendtu_last_update > 30)

Image

towolf avatar Aug 10 '25 18:08 towolf