[feature request] Add option to remove metric series that are no longer present in observable measurements
Package
OpenTelemetry
Is your feature request related to a problem?
No response
What is the expected behavior?
We are using the ObservableGauge provided by the dotnet Meter to create dynamic metric series, that are then exported by the OTEL-Exporter and sent via Grafana Alloy to Grafana (Mimir). As far as we can tell, there is currently no way to remove a metric series from the exporter once a measurement was collected. When no measurement is provided for a series by the measurement function, the exporter continues to export the last known measurement value until the application is restarted. We have some use-cases with dynamic data, where we need to be able to stop exporting a series without restarting the entire application.
Which alternative solutions or features have you considered?
We already considered setting the value of the series to a pre-defined value like 0 or -100, to then filter all series with this value in Grafana or an Open Telemetry Processor (like Grafana Alloy). This works for some of our use-cases, but still has a lot of configuration overhead.
Additional context
No response
I believe this is the same bug shown here : https://github.com/open-telemetry/opentelemetry-dotnet/pull/5952/files
Hello @cijothomas
Thanks for pointing out your pull request. It seems like you are right regarding the SDK not following specs in asynchronous collection. The behavior described here is precisely what we need.
@Noahnc Would you have time/interest in checking and offering a PR with a fix?
I was facing the same issue and decided to tackle a fix tonight before seeing this discussion.
What led me down this path was that I tried disposing the Meter and it partially worked. It calls System.Diagnostics.Metrics.Instrument.NotifyForUnpublishedInstrument() which eventually sets OpenTelemetry.Metrics.Metric.Active = false and eventually stops the associated series.
This has two unwanted side-effects.
- The staleness marker (called a NoRecordedValue flag in OTLP) is not set by the instrumentation, so the receiver (in our case a Collector with PrometheusRemoteWrite) uses its own staleness timeout logic. This repeats the stale value for a short time (5 minutes in our situation).
- The solution leaks a Metric out of the default 1000 limit as per this comment.
I'm working on a fix to defragment the metrics list after removals and send NoRecordedValue data points when a metric turns Inactive. See very basic WIP code that still fails some tests.
@cijothomas Is that design acceptable? There is a comment about keeping the removed metric and reusing it if it gets recreated nstead, but that seemed risky because it would leak storage for metrics that rotated without reusing the same identities...
I was about to open an issue for this because it appears to be a discrepancy with the SDK specification as discussed in https://github.com/open-telemetry/opentelemetry-specification/issues/2232#issuecomment-2605265979
From the linked comment:
https://github.com/open-telemetry/opentelemetry-specification/blob/0319dea685a3d0e65ed55db7c56b80196ae5eefe/specification/metrics/sdk.md?plain=1#L769-L772
I was about to open an issue for this because it appears to be a discrepancy with the SDK specification as discussed in open-telemetry/opentelemetry-specification#2232 (comment)
From the linked comment:
https://github.com/open-telemetry/opentelemetry-specification/blob/0319dea685a3d0e65ed55db7c56b80196ae5eefe/specification/metrics/sdk.md?plain=1#L769-L772
Yes this is a known bug, shown in https://github.com/open-telemetry/opentelemetry-dotnet/pull/5952 (I don't have bandwidth to continue that Test or the actual fix. Feel free to take it, if you prefer)