oteps icon indicating copy to clipboard operation
oteps copied to clipboard

Semantic conventions for Uptime Monitoring

Open jsuereth opened this issue 3 years ago • 7 comments

A proposal and guide.

jsuereth avatar Nov 10 '21 18:11 jsuereth

I'd like to consider an alternative not mentioned in this document, and I'm not sure where to propose it.

Instead of two metrics, "health" and "uptime", I propose a single non-monotonic Sum named "alive" with value 1. This data type requires that a start time be included with the measurement, unlike Gauge. The difference between the start time and the measurement time is the process uptime.

I have made this proposal already in connection with https://github.com/open-telemetry/opentelemetry-specification/issues/1078, where I pointed out that we can implement service discovery in a push-based metrics system by joining this "alive" metric with information retrieved by service discovery.

jmacd avatar Nov 15 '21 20:11 jmacd

@jmacd Commented offline, but recording here for posterity.

Instead of two metrics, "health" and "uptime", I propose a single non-monotonic Sum named "alive" with value 1. This data type requires that a start time be included with the measurement, unlike Gauge. The difference between the start time and the measurement time is the process uptime.

From a pure collection standpoint, I like a lot of what this brings, however I think we need to take an end-to-end focus. Specficially: "Can I write a query / dashboard / alert to solve the stated use cases?"

AFAICT, with known backends/query languages (Prometheus, Graphite, etc.) it's hard to pull the data back out, specifically the "Seconds since start" value in PromQL. We should make sure we have an answer to that.

jsuereth avatar Nov 16 '21 17:11 jsuereth

@jsuereth how important/relevant is this OTEP? Please assign an appropriate priority, or close if it's old and we no longer need it.

tedsuo avatar Jan 30 '23 17:01 tedsuo

What is the state of this? It is still not clear to me how to implement this in otel. I suppose uptime is ok, but the health metric as 1|0 makes it not so useful. Should I then just do uptime for both, and only update health if the checks succeed?

Is it not a common use-case that most services would need this in some way? Or are people just relying directly on kubernetes checks instead? I understand that metric such as ops/sec. are much better, but not all services are doing stuff all the time, so this is much needed for those.

I had made an issue on this but closed it expecting this might progress. https://github.com/open-telemetry/opentelemetry-specification/issues/2923

tomasmota avatar Apr 20 '23 13:04 tomasmota

I'm also curious about the state of this proposal since I'm having the same use case as described in https://github.com/open-telemetry/opentelemetry-specification/issues/2923

erasmas avatar Jun 13 '23 15:06 erasmas

@jsuereth is this stale, or is semconv currently working on this?

tedsuo avatar Jul 31 '23 16:07 tedsuo

I would also be interested in this. A generic up metric for creation of generic uptime alerting would be awesome. Especially having it in the standard itself and e.g. integrated to OpenTelemetry Collector.

Manuelraa avatar Sep 29 '23 09:09 Manuelraa