modbus_exporter icon indicating copy to clipboard operation
modbus_exporter copied to clipboard

Catch and expose well-defined error/information values

Open RichiH opened this issue 3 years ago • 4 comments

Some devices send special and locally-well-defined numeric values to signal state, e.g. "Resource busy".

The initial plan was to hide those values and print errors to STDOUT, see https://github.com/RichiH/modbus_exporter/pull/32

This is not ideal as it forces people to look at the logs, hides information from PromQL, and makes these situations generally invisible/harder to debug.

Current thinking is to introduce a gauge which exposes this value instead.

So metric foo will dynamically get a foo_caught = 1234 where 1234 is the magic value. While OpenMetrics StateSet would allow for better and cleaner mapping, it would multiply the metrics exposed.

Two open questions:

  • Should foo go away, be set to zero, or retain its old value? Going away for the time being seems cleanest.
  • Should foo_caught go away once the special state is gone, be set to zero, or be set to zero and then go away? Going away seems cleanest yet again.

CC @SuperQ for thoughts.

RichiH avatar Aug 23 '22 13:08 RichiH

@bastischubert @DaAwesomeP @SuperQ thoughts?

RichiH avatar Mar 14 '23 12:03 RichiH

IMO (and I am definitely and absolutely not a definitive source, just my opinion):

  • foo should go away when there is not a current value to expose, so it should go away on error
  • However with foo_caught I am more of the opinion that there should be a 0 state to signify "OK" since it is very possible a user would forget to implement alerts or visualization if it is not present initially. I think also that "no errors caught/operating normally" is definitely a valid state separate from the actual value of the sensor/device, which would be in foo.
  • Maybe instead of calling it foo_caught (which implies that it should disappear if nothing is caught) it should instead be foo_error which can continue to exist to say "no current error."

Are there other exporters we can look at that have similar metrics or have solved a similar issue? Not saying we should blindly copy it but it would be good to see other examples.

I should also note that I don't think I have any devices on-hand that support these error codes.

DaAwesomeP avatar Mar 14 '23 13:03 DaAwesomeP

Signalling OK state would mean doubling the metric count. I am not against it, yet still apprehensive. Or maybe we could expose this only for metrics which have a special handler defined?

I don't believe there's precedent. The most complex exporter in this regard is snmp_exporter; it has varies fixes for other protocol warts though.

RichiH avatar Mar 14 '23 13:03 RichiH

Signalling OK state would mean doubling the metric count. I am not against it, yet still apprehensive.

Yeah...understandable.

Or maybe we could expose this only for metrics which have a special handler defined?

Yeah, that makes sense; then you're only adding metrics where it would actually work. And I imagine that most devices don't support this exceptions of every single address anyway.

In my case I am only monitoring simple I/O devices so I would almost never encounter this feature and would leave it off.

DaAwesomeP avatar Mar 14 '23 13:03 DaAwesomeP