Catch and expose well-defined error/information values
Some devices send special and locally-well-defined numeric values to signal state, e.g. "Resource busy".
The initial plan was to hide those values and print errors to STDOUT, see https://github.com/RichiH/modbus_exporter/pull/32
This is not ideal as it forces people to look at the logs, hides information from PromQL, and makes these situations generally invisible/harder to debug.
Current thinking is to introduce a gauge which exposes this value instead.
So metric foo will dynamically get a foo_caught = 1234 where 1234 is the magic value. While OpenMetrics StateSet would allow for better and cleaner mapping, it would multiply the metrics exposed.
Two open questions:
- Should
foogo away, be set to zero, or retain its old value? Going away for the time being seems cleanest. - Should
foo_caughtgo away once the special state is gone, be set to zero, or be set to zero and then go away? Going away seems cleanest yet again.
CC @SuperQ for thoughts.
@bastischubert @DaAwesomeP @SuperQ thoughts?
IMO (and I am definitely and absolutely not a definitive source, just my opinion):
-
fooshould go away when there is not a current value to expose, so it should go away on error - However with
foo_caughtI am more of the opinion that there should be a0state to signify "OK" since it is very possible a user would forget to implement alerts or visualization if it is not present initially. I think also that "no errors caught/operating normally" is definitely a valid state separate from the actual value of the sensor/device, which would be infoo. - Maybe instead of calling it
foo_caught(which implies that it should disappear if nothing is caught) it should instead befoo_errorwhich can continue to exist to say "no current error."
Are there other exporters we can look at that have similar metrics or have solved a similar issue? Not saying we should blindly copy it but it would be good to see other examples.
I should also note that I don't think I have any devices on-hand that support these error codes.
Signalling OK state would mean doubling the metric count. I am not against it, yet still apprehensive. Or maybe we could expose this only for metrics which have a special handler defined?
I don't believe there's precedent. The most complex exporter in this regard is snmp_exporter; it has varies fixes for other protocol warts though.
Signalling OK state would mean doubling the metric count. I am not against it, yet still apprehensive.
Yeah...understandable.
Or maybe we could expose this only for metrics which have a special handler defined?
Yeah, that makes sense; then you're only adding metrics where it would actually work. And I imagine that most devices don't support this exceptions of every single address anyway.
In my case I am only monitoring simple I/O devices so I would almost never encounter this feature and would leave it off.