opentelemetry-specification icon indicating copy to clipboard operation
opentelemetry-specification copied to clipboard

Metric (observer) collections should be independent from metric export

Open Oberon00 opened this issue 4 years ago • 9 comments

(the metric spec is still very unfinished, so this might already be included in one of the more general metric spec issues)

Consider this use case:

  • I want to collect JVM memory metrics (e.g. currently committed memory bytes) every 10 seconds
  • I want to export this (and all other metrics) aggregated (min/max/avg) every six collections i.e. every minute.

I think the spec currently assumes (or assumed before it was deleted) that metric collection and export happens in the same step, but this seems overly restrictive.

What is more, it would make sense to have each observer instrument run potentially on it's own schedule. E.g. some metrics might profit more from more granularity than others, some might be expensive to retrieve, others cheap.

(This paragraph might warrant it's own issue:) A good implementation of this would also ensure that collections whose intervals are a proper fraction or multiple of the export internval do not in-deterministically move between exports. For example, in the above JVM use case, an implementation that does not do any synchronization between export and collection will have a race between every 6th collection of metrics and export which are both supposed to happen at T=60. Such an implementation would sometimes send an aggregate of only 5 and then the next time up to 7 collections. This might be undesirable and should be avoidable without much performance overhead.

Oberon00 avatar Feb 11 '21 15:02 Oberon00

@Oberon00 who is "I" in this example "instrumentation author"? "application owner"?

bogdandrutu avatar Feb 11 '21 16:02 bogdandrutu

@bogdandrutu very good question actually 🤔 The two "I" might not even be the same. For collection interval, we are interested in:

  1. How fluctuating is this metric actually (CPU usage vs number of loaded classes) -- Usually the instrumentation author has a good grasp on this, but the application owner might want to influence this e.g. to debug specific problems.
  2. How expensive is it to collect the metric -- Usually the instrumentation author has a good estimate here too, but it might depend on the deployment situation, etc. (e.g. what if we are deployed to a machine with 256 cores -- collecting CPU usage might become unexpectedly expensive)
  3. How interesting is this metric for the use case of the application owner

For export interval, the instrumentation author has no clue. I think this is more for the application owner to decide (that would be a typical environment variable setting). The backend's current/usual load factor could also be a factor here, as you can't export so often that you overwhelm it.

A recommended minimum export interval can also be calculated as the greatest common divisor of all collection intervals by the SDK which knows all registered observers. However, that would often be 1, if we do not provide some guidance on selecting the intervals.

One way might be to allow instrumentation authors only coarse control like HIGH_FREQUENCY, FREQUENT, NORMAL, INFREQUENT, SEMI_STATIC.

Oberon00 avatar Feb 11 '21 16:02 Oberon00

I think we have/want the ability to create multiple instances of a "pipeline" (aka SDK instance), so that each pipeline can be configure differently. This includes setting the SetPushIntervals (currently one setting for both collection and export), aggregators, exporters, etc... We should also allow pipelines to filter (include/exclude) for specific metric.

I think this issue maybe related to #1437 as well. We need ability to let each Library set it's "scope" so that it can be "attached" to a SDK / pipeline instance.

victlu avatar Feb 17 '21 18:02 victlu

I think we have/want the ability to create multiple instances of a "pipeline" (aka SDK instance), so that each pipeline can be configure differently

This is something different than this issue. In this issue I propose to decouple the first pipeline stage (collection) from the second (export).

Oberon00 avatar Feb 17 '21 18:02 Oberon00

I know that @bogdandrutu himself has called for making the collection interval variable on a per-instrument basis. As you say @Oberon00, I believe this is about configuration in the "Accumulator", which flushes synchronous and asynchronous data into a processor and then exporter. I think you'll find 100% agreement from the group, it's just more work to spec out how this configuration should be done.

jmacd avatar Feb 24 '21 08:02 jmacd

I agree that it is more spec work, that's why I also filed #1433 😃

Oberon00 avatar Feb 24 '21 08:02 Oberon00

We've discussed this during the 9/2/2021 Metrics SIG meeting and decided to move this out of the initial metrics spec release scope. This can be supported with https://github.com/open-telemetry/opentelemetry-specification/pull/1888#discussion_r696353281.

reyang avatar Sep 03 '21 00:09 reyang

Some thoughts...

Having "collect + export" done at the same time covers most of the use cases and should be the behavior of a periodic exporting reader. I would avoid adding such behavior to a periodic exporting reader.

Could such functionality be provided via another reader implementation? Maybe the SDK can provide an additional reader implementation which could handle this use case? If so then, do we need to specify it?

pellared avatar Aug 02 '23 08:08 pellared

Having "collect + export" done at the same time covers most of the use cases and should be the behavior of a periodic exporting reader.

I strongly disagree here. As an app owner, responsible for several critical applications subjected to request bursts, I can tell you that collecting CPU/MEM once a minute not good enough for any of my scenarios.

For the new applications, or applications that suffered changes in functionality recently or else, once a minute collection is too little because A LOT can happen in that 1 min... eg: my app reports ZERO http queue length but at the WAF I have several 503's because a burst of requests came in at a very unfortunate time and the APIM has a hard timeout of 10sec..

For the old applications, where things did not change anything in past 2 years or more and all else is stable, once a minute is basically a waste... you could capture 99% of the metrics once every 10min or more without any actual impact...

in my point of view, the person who says "this one-size-fits-all metrics collection every X amount of time for all my applications is ok" is not really hands-on involved in any kind of critical/relevant service... I've seen this behavior again and again in IT managers that will not have to actually troubleshoot anything ever again...

Leonardo-Ferreira avatar Apr 30 '24 16:04 Leonardo-Ferreira

Is this issue a duplicate of #3617? If so, can we consolidate?

jack-berg avatar Jul 31 '24 16:07 jack-berg

Duplicate of #3617

jsuereth avatar Aug 07 '24 15:08 jsuereth