opentelemetry-specification icon indicating copy to clipboard operation
opentelemetry-specification copied to clipboard

Specify how to stop and restart metrics reporting

Open jmacd opened this issue 2 years ago • 1 comments

What are you trying to achieve?

Trying to address
https://github.com/open-telemetry/opentelemetry-js/issues/2997 and https://github.com/open-telemetry/opentelemetry-specification/issues/1891

There are situations when a metrics SDK wants to stop reporting data for a particular instrument and attribute set. This comes about differently for asynchronous/synchronous instruments, depending on cardinality choice.

In every combination of sync/async and delta/cumulative, we find the situation may arise. We find that to safely stop reporting metrics requires attention to what information is lost to the consumer, especially where it may lead to inaccurate rate calculations.

For example:

  • synchronous/cumulative: cumulative implies long-term memory use, so need to stop reporting when too much is too much; the question is what start time to use when restarting, and what it says about the rate in the interim period
  • synchronous/delta: this case is always safe, SDKs are free to stop reporting an instrument/attribute pair when it has not been used during a collection cycle; this leaves a gap, considered normal for delta temporality
  • asynchronous/cumulative: in this case, it is safe to stop reporting the instrument/attribute pair--the user is free to simply not observe the attributes. This will leave a gap in the record, it's is not considered good practice for cumulative reporting
  • asynchronous/delta: in this case, it is safe to stop reporting the instrument/attribute pair, but restarting the same instrument/attribute pair is complicated for the same reason as synchronous/cumulative.

What did you expect to see?

In the 8/3 Prometheus-WG SIG meeting this was discussed. An idea to use the NO_DATA_PRESENT staleness marker as a way to communicate to the consumer. There appears to be some benefit to issuing NO_DATA_PRESENT data points for a period of time before being allowed to forget the value and erase it from memory.

Informally, I think we expect to see that in case the same instrument/attributes pair is re-used immediately, the new start time assigned will be no earlier than the last NO_DATA_PRESENT data point that was written. Ideally, the new start time assigned will be no later than the previous collection timestamp..

jmacd avatar Aug 04 '22 19:08 jmacd

asynchronous/delta: in this case, it is safe to stop reporting the instrument/attribute pair, but restarting the same instrument/attribute pair is complicated for the same reason as synchronous/cumulative.

In the past we've talked about asynchronous instruments being responsible for managing time series. That is, if a callback stops reporting a particular series, its ok for the SDK to forget it and stop reporting it in both cumulative and delta cases. If they later start reporting the series again, the SDK starts reporting the values, but delta aggregations don't need to report the diff between the latest and last reported value (before the reporting stopped) - the initial reported delta is the first recorded value.

If you buy this, then it's up to the callback to understand their role in timeseries management, and understand the semantic meaning when they stop reporting a series.

jack-berg avatar Aug 05 '22 20:08 jack-berg

Looking at the new cardinality limits in the spec, specifically for synchronous instruments it says

Views of synchronous instruments with cumulative aggregation temporality MUST continue to export the all attribute sets that were observed prior to the beginning of overflow. Metric events corresponding with attribute sets that were not observed prior to the overflow will be reflected in a single data point described by (only) the overflow attribute.

IOW always keep the oldest streams and collapse any new ones into the overflow. However, there are high cardinality use cases where you may never see the old streams again and you want to free up that memory for something new.

This is a MUST requirement in the spec right now–I think we need to discuss this issue before marking the cardinality limits section stable.

aabmass avatar Jun 29 '23 18:06 aabmass