oteps Semantic conventions for Uptime Monitoring

A proposal and guide.

Nov 10 '21 18:11 jsuereth

I'd like to consider an alternative not mentioned in this document, and I'm not sure where to propose it.

Instead of two metrics, "health" and "uptime", I propose a single non-monotonic Sum named "alive" with value 1. This data type requires that a start time be included with the measurement, unlike Gauge. The difference between the start time and the measurement time is the process uptime.

I have made this proposal already in connection with https://github.com/open-telemetry/opentelemetry-specification/issues/1078, where I pointed out that we can implement service discovery in a push-based metrics system by joining this "alive" metric with information retrieved by service discovery.

Nov 15 '21 20:11 jmacd

@jmacd Commented offline, but recording here for posterity.

Instead of two metrics, "health" and "uptime", I propose a single non-monotonic Sum named "alive" with value 1. This data type requires that a start time be included with the measurement, unlike Gauge. The difference between the start time and the measurement time is the process uptime.

From a pure collection standpoint, I like a lot of what this brings, however I think we need to take an end-to-end focus. Specficially: "Can I write a query / dashboard / alert to solve the stated use cases?"

AFAICT, with known backends/query languages (Prometheus, Graphite, etc.) it's hard to pull the data back out, specifically the "Seconds since start" value in PromQL. We should make sure we have an answer to that.

Nov 16 '21 17:11 jsuereth

@jsuereth how important/relevant is this OTEP? Please assign an appropriate priority, or close if it's old and we no longer need it.

Jan 30 '23 17:01 tedsuo

What is the state of this? It is still not clear to me how to implement this in otel. I suppose uptime is ok, but the health metric as 1|0 makes it not so useful. Should I then just do uptime for both, and only update health if the checks succeed?

Is it not a common use-case that most services would need this in some way? Or are people just relying directly on kubernetes checks instead? I understand that metric such as ops/sec. are much better, but not all services are doing stuff all the time, so this is much needed for those.

I had made an issue on this but closed it expecting this might progress. https://github.com/open-telemetry/opentelemetry-specification/issues/2923

Apr 20 '23 13:04 tomasmota

I'm also curious about the state of this proposal since I'm having the same use case as described in https://github.com/open-telemetry/opentelemetry-specification/issues/2923

Jun 13 '23 15:06 erasmas

@jsuereth is this stale, or is semconv currently working on this?

Jul 31 '23 16:07 tedsuo

I would also be interested in this. A generic up metric for creation of generic uptime alerting would be awesome. Especially having it in the standard itself and e.g. integrated to OpenTelemetry Collector.

Sep 29 '23 09:09 Manuelraa

oteps oteps copied to clipboard

Semantic conventions for Uptime Monitoring

oteps
oteps copied to clipboard