opentelemetry-specification Support for "Gauge histogram" data type, instrumentation

Support for "Gauge histogram" data type, instrumentation

Open jmacd opened this issue 1 year ago • 4 comments

What are you trying to achieve?

OpenTelemetry has not incorporated the Gauge histogram instrument. Several issues have been filed at various times in the otel-proto repository about this, but nothing in the specification repo.

For example:

https://github.com/open-telemetry/opentelemetry-proto/pull/236
https://github.com/open-telemetry/opentelemetry-proto/issues/274
https://github.com/open-telemetry/opentelemetry-proto/issues/308

What did you expect to see?

To add this to the OTel API/SDK and data model, some equivalencies will have to be drawn. It should be possible to replace async Gauge instrumentation, where every observation has a unique attribute set, with a Gauge Histogram aggregator that counts the number of appearances of each value (after erasing some attributes). In this sense, it is probably not necessary to implement new instruments for Gauge histogram, we. can just provide new ways of aggregating Gauges to obtain the Gauge histogram.

Additional context.

https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#gaugehistogram-1

Aug 05 '22 16:08 jmacd

Do you have a concrete example of a type of instrumentation that would benefit from this type of aggregation?

where every observation has a unique attribute set, with a Gauge Histogram aggregator that counts the number of appearances of each value (after erasing some attributes).

Are you imagining aggregating gauge measurements to the existing explicit / exponential histogram data point (with potential modification), or introducing a new data point type?

Aug 05 '22 18:08 jack-berg

Do you have a concrete example [...]?

Sure. I filed this in connection with https://github.com/open-telemetry/opentelemetry-go-contrib/issues/2624, where one appears. The Go runtime/metrics package produces one Gauge Histogram,

/sched/latencies:seconds
	Distribution of the time goroutines have spent in the scheduler
	in a runnable state before actually running.

It's not immediately clear from the documentation that this is a gauge histogram, but it reports as Float64Histogram (its value type) and Cumulative==False (i.e., not a counter).

The way you compute this, in pseudocode,

schedLatency := meter.newGaugeHistogramInstrument()

for _, runnableGoroutine := range AllRunnableGoroutines {
  schedLatency.Observe(runnnableGoroutine, runnableGoroutine.elapsedWaitTimeSeconds())
}

Logically speaking, I believe you could instrument the same using a Gauge with Histogram aggregation, however, ...

Are you imagining aggregating gauge measurements to the existing explicit / exponential histogram data point (with potential modification), or introducing a new data point type?

This is the question. The data structures for explicit / exponential histogram work, but the temporality setting does not (related: https://github.com/open-telemetry/opentelemetry-proto/issues/274). If we replace the current temporality field in a backwards-compatible way, then we can express non-temporal histogram points (i.e., gauge histograms). I don't want to introduce new types.

I can imagine a new enum that applies only to histograms, that is to express both temporality or the lack thereof.

enum HistogramTemporality {
  Cumulative // i.e., a "Counter" histogram
  Delta // i.e., a "Counter" histogram
  Gauge // i.e., a "Gauge" histogram
}

This can be done in a wire-compatible way for protobuf encodings. It would break JSON, likely.

To see how this could equivalently be instrumented using Gauge instruments, consider a kind of "ephemeral" attribute that we'll call nonce. I will use a float64 NaN value as the value of this attribute, so by IEEE logic there can never be two attributes that are equal. Thus, every Gauge observation is unique. Now, configure a view to use Histogram aggregation and erase the nonce attribute. The result should be a histogram with one count per gauge observation and HistogramTemporality (as defined above) equal to "Gauge".

Aug 05 '22 18:08 jmacd

I have a few clarifying questions/assumptions:

Does the GaugeHistogram look almost the same as a Histogram with Cumulative temporality?
- The count for each bucket represents the total number of elements in this bucket at this time
- The difference is the same as between cumulative Counters and Gauges outside of the histogram context (Counters are additive, Gauges are not)
The goroutine example aims at measuring "how many goroutines are waiting to be run, separated into buckets by how long they have been waiting already", is that right? And the runtime already returns a histogram-like object, so this is also related to https://github.com/open-telemetry/opentelemetry-specification/issues/2713?
I think you pointed it out in the last paragraph, but this would also be representable as a set of gauges, each of them with an attribute like le=1 that has the number of (in this example) threads that have been waiting for less than 1s. The additional value of the GaugeHistogram is that it condenses this data and packs it into one metric, did I get that right?

Aug 08 '22 10:08 pirgeo

I have a few clarifying questions/assumptions:

Does the GaugeHistogram look almost the same as a Histogram with Cumulative temporality?

The count for each bucket represents the total number of elements in this bucket at this time

The difference is the same as between cumulative Counters and Gauges outside of the histogram context (Counters are additive, Gauges are not)

The goroutine example aims at measuring "how many goroutines are waiting to be run, separated into buckets by how long they have been waiting already", is that right? And the runtime already returns a histogram-like object, so this is also related to Support for asynchronous Histogram instrument #2713?

I think you pointed it out in the last paragraph, but this would also be representable as a set of gauges, each of them with an attribute like le=1 that has the number of (in this example) threads that have been waiting for less than 1s. The additional value of the GaugeHistogram is that it condenses this data and packs it into one metric, did I get that right?

And in addition, gauge histogram describes the distribution of "what's currently running/exist/available/etc.", so it makes less sense to sum up the values from different time. The default aggregation seems to be "Last Value".

I can imagine a new enum that applies only to histograms, that is to express both temporality or the lack thereof.

enum HistogramTemporality {
  Cumulative // i.e., a "Counter" histogram
  Delta // i.e., a "Counter" histogram
  Gauge // i.e., a "Gauge" histogram
}

I like this idea 👍

I think our existing 6 instruments are already requiring many users to shift their minds, and I've seen folks making mistakes while choosing the correct instrument. That's why we have https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#instrument-selection and it's not that simple. I feel that we should really keep a high bar and try not to add another one, unless the value added is much bigger than the newly introduced learning overhead.

Aug 12 '22 15:08 reyang

opentelemetry-specification opentelemetry-specification copied to clipboard

Support for "Gauge histogram" data type, instrumentation

opentelemetry-specification
opentelemetry-specification copied to clipboard