[bug] - Memory leak
Package
OpenTelemetry
Package Version
| Package Name | Version |
|---|---|
| OpenTelemetry.Api | 1.9.0 |
| OpenTelemetry | 1.9.0 |
| OpenTelemetry.Exporter.Prometheus.HttpListener | 1.9.0-beta.2 |
| OpenTelemetry.Extensions.Hosting | 1.9.0 |
| OpenTelemetry.Instrumentation.Runtime | 1.9.0 |
Runtime Version
net8.0
Description
I need to Dispose(..) often my Meter instance, because today Meter doesn't provide a way to delete an existing metric dynamically. See : https://github.com/dotnet/runtime/issues/83822.
When I hit meter.Dispose(), the dotnet runtime will notify each instrument perviously created that they are not longer published.
See: https://github.com/dotnet/runtime/blob/9e59acb298c20658788567e0c6f0793fe97d37f6/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/Meter.cs#L519
The OpenTelemetry MetricReader (in my case BaseExportingMetricReader) must deactivateMetric :
See: https://github.com/open-telemetry/opentelemetry-dotnet/blob/5dff99f8a0b26ff75248d5f2e478b9c3c42f5678/src/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs#L30
But it seems there is a memory leak, because when I run my program for a long time, the memory still growing. When I debug my app, I see a lot of MetricPoint[] instances unreleased. More time my application run, more memory consumption is required only for MetricPoint[] instances.
It seems when meter.Dispose() is called, the open telemetry metrics are not released properly.
Screen of my debugger at 10 min uptimes :
Screen of my debugger at 60 min uptimes (122 Mb of MetricsPoint[] )
Steps to Reproduce
public static void Main(string[] args)
{
Meter meter = null;
var meterProviderBuilder = Sdk
.CreateMeterProviderBuilder()
.AddMeter("Test")
.SetResourceBuilder(
ResourceBuilder.CreateDefault()
.AddService(serviceName: "Test"));
meterProviderBuilder.AddPrometheusHttpListener();
meterProviderBuilder.AddRuntimeInstrumentation();
var tracerProvider = meterProviderBuilder.Build();
while(true){
meter?.Dispose();
meter = new Meter("Test");
var rd = new Random();
meter.CreateObservableGauge(
"requests",
() => new[] {
new Measurement<double>(rd.NextDouble() * rd.Next(100))
},
description: "Request per second");
// will see after couple of minutes that the MetricReader contains a lot of MetricPoint[], even if we dispose the Meter after each iteration
Thread.Sleep(200);
}
}
Expected Result
All perviously MetricPoint released
Actual Result
Memory leak
Additional Context
No response
For tracking : https://github.com/LGouellec/streamiz/issues/361
meter?.Dispose(); meter = new Meter("Test");
This looks like a new Meter of same name is created right after disposing it? Not sure what is the intend here.
@cijothomas
meter?.Dispose(); meter = new Meter("Test");
It's just a reproducer, my initial use case is to expose OpenTelemetry metrics for a Kafka Streams application. Don't know if you are familiar with Kafka, but the assignment of partitions can change, so it means you have to drop some metrics in some cases.
Today, we can't drop a metric for a specific Meter instance. Open Issue here : https://github.com/dotnet/runtime/issues/83822 Discussion here : https://github.com/open-telemetry/opentelemetry-specification/issues/3062
So that's why for now, I dispose the current Meter instance, and recreate from scratch for each iteration. With that, I don't need to drop metrics, just not exposing the old metrics in the next iteration.
My point here is when the Meter instance is disposed, the open telemetry metrics are not released properly.
@cijothomas
Is there a way in OpenTelemetry to clear and dispose all the resources without closing the MeterProvider ?
Hey @LGouellec!
I spent some time on this tonight. Here is what I think is going on...
-
When you dispose your
Meterwe track theMetricas deactivated: https://github.com/open-telemetry/opentelemetry-dotnet/blob/9f41eadf03f3dcc5e76c686b61fb39849f046312/src/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs#L29 But nothing happens until... -
...a collection is run. Once we have exported the data, we clean up the
Metric: https://github.com/open-telemetry/opentelemetry-dotnet/blob/9f41eadf03f3dcc5e76c686b61fb39849f046312/src/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs#L275
The problem is in your repro there you are using AddPrometheusHttpListener which is a pull exporter. No collection will run unless you cause one.
Do me a favor and switch your code to OtlpExporter and re-run the test. That is a push exporter meaning it will run periodically and cause a cleanup. I suspect if you run that the memory will be more stable.
Should we do something here? I'm not sure what we could do. Open to ideas. But I will say, what you are proving by constantly disposing Meters I don't think is how any of the APIs were designed to be used 😄
Hello @cijothomas ,
I tried with the OtlpExporter :
meterProviderBuilder.AddOtlpExporter(options => {
options.Protocol = OtlpExportProtocol.Grpc;
options.ExportProcessorType = ExportProcessorType.Batch;
});
Same results, metrics are correctly exporting in the OtelCollector, but the memory keep growing locally due to a unreleased MetricPoint :
Until we can delete metrics in
Meter, we can't have stable memory.
One question is remaining, Why you don't clear the metric point when you receive the notification from Meter ?
One question is remaining, Why you don't clear the metric point when you receive the notification from Meter ?
As clarified here, Metric points are cleared after the next export only, not immediately upon Meter dispose, as we still want to export pending updates before throwing away MetricPoints.
Are you seeing that OpenTelemetry sdk is not releasing MetricPoints even after export?
Are you seeing that OpenTelemetry sdk is not releasing MetricPoints even after export?
Yes exactly, the memory keep growing, the number of MetricPoint[] continue to increase over time, even after multiple minutes. So no I'm pretty sure that the metrics point are not cleared after the next export.
Are you seeing that OpenTelemetry sdk is not releasing MetricPoints even after export?
Yes exactly, the memory keep growing, the number of
MetricPoint[]continue to increase over time, even after multiple minutes. So no I'm pretty sure that the metrics point are not cleared after the next export.
Could you confirm that Meter is disposed as well? If Meter is not disposed, then MetricPoints are not freed up by default.
Running this:
public static void Main()
{
Meter? meter = null;
var meterProviderBuilder = Sdk
.CreateMeterProviderBuilder()
.AddMeter("Test")
.SetResourceBuilder(
ResourceBuilder.CreateDefault()
.AddService(serviceName: "Test"));
meterProviderBuilder.AddOtlpExporter((exporterOptions, readerOptions) =>
{
readerOptions.PeriodicExportingMetricReaderOptions.ExportIntervalMilliseconds = 5000;
});
using var meterProvider = meterProviderBuilder.Build();
while (true)
{
meter?.Dispose();
meter = new Meter("Test");
var rd = new Random();
meter.CreateObservableGauge(
"requests",
() => new[]
{
new Measurement<double>(rd.NextDouble() * rd.Next(100)),
},
description: "Request per second");
// will see after couple of minutes that the MetricReader contains a lot of MetricPoint[], even if we dispose the Meter after each iteration
Thread.Sleep(200);
}
}
Seems to be doing its job nicely:
The red bars on "Object delta (%change)" are objects being collected.
@CodeBlanch Running the same snippet of code, and the number of MetricPoint[] is huge no ? 54Mb just for MetricPoint[]
375 of MetricPoint[], and each object contains 2002 instances of MetricPoint.
Running .NET 8.0
@cijothomas Meter is disposed every times like in the snippet of code.
@CodeBlanch it seems that adding :
GC.Collect();
help to reduce the memory footprint :
Example:
public static void Main()
{
Meter? meter = null;
var meterProviderBuilder = Sdk
.CreateMeterProviderBuilder()
.AddMeter("Test")
.SetResourceBuilder(
ResourceBuilder.CreateDefault()
.AddService(serviceName: "Test"));
meterProviderBuilder.AddOtlpExporter((exporterOptions, readerOptions) =>
{
readerOptions.PeriodicExportingMetricReaderOptions.ExportIntervalMilliseconds = 5000;
});
using var meterProvider = meterProviderBuilder.Build();
while (true)
{
meter?.Dispose();
GC.Collect();
meter = new Meter("Test");
var rd = new Random();
meter.CreateObservableGauge(
"requests",
() => new[]
{
new Measurement<double>(rd.NextDouble() * rd.Next(100)),
},
description: "Request per second");
// will see after couple of minutes that the MetricReader contains a lot of MetricPoint[], even if we dispose the Meter after each iteration
Thread.Sleep(200);
}
}
But I have a question, I have running my application more than 15 minutes, and at some point it seems that the metrics are no longer exported, do you know what's going on ? Otel Collector Logs
2024-11-05T21:52:31.269Z info [email protected]/service.go:135 Setting up own telemetry...
2024-11-05T21:52:31.274Z info telemetry/metrics.go:70 Serving metrics {"address": "localhost:8888", "metrics level": "Normal"}
2024-11-05T21:52:31.275Z info builders/builders.go:26 Development component. May change in the future. {"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-11-05T21:52:31.283Z info [email protected]/service.go:207 Starting otelcol... {"Version": "0.112.0", "NumCPU": 8}
2024-11-05T21:52:31.283Z info extensions/extensions.go:39 Starting extensions...
2024-11-05T21:52:31.283Z info extensions/extensions.go:42 Extension is starting... {"kind": "extension", "name": "health_check"}
2024-11-05T21:52:31.283Z info [email protected]/healthcheckextension.go:33 Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-11-05T21:52:31.283Z info extensions/extensions.go:59 Extension started. {"kind": "extension", "name": "health_check"}
2024-11-05T21:52:31.283Z info extensions/extensions.go:42 Extension is starting... {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.283Z info [email protected]/zpagesextension.go:55 Registered zPages span processor on tracer provider {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.284Z info [email protected]/zpagesextension.go:65 Registered Host's zPages {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.284Z info [email protected]/zpagesextension.go:77 Starting zPages extension {"kind": "extension", "name": "zpages", "config": {"Endpoint":":55679","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0}}
2024-11-05T21:52:31.284Z warn [email protected]/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. {"kind": "extension", "name": "zpages", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-11-05T21:52:31.285Z info extensions/extensions.go:59 Extension started. {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.285Z info extensions/extensions.go:42 Extension is starting... {"kind": "extension", "name": "pprof"}
2024-11-05T21:52:31.285Z info [email protected]/pprofextension.go:61 Starting net/http/pprof server {"kind": "extension", "name": "pprof", "config": {"TCPAddr":{"Endpoint":":1888","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
2024-11-05T21:52:31.285Z info extensions/extensions.go:59 Extension started. {"kind": "extension", "name": "pprof"}
2024-11-05T21:52:31.285Z warn [email protected]/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. {"kind": "exporter", "data_type": "metrics", "name": "prometheus", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-11-05T21:52:31.285Z warn [email protected]/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. {"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-11-05T21:52:31.285Z info [email protected]/otlp.go:112 Starting GRPC server {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "0.0.0.0:4317"}
2024-11-05T21:52:31.286Z info healthcheck/handler.go:132 Health Check state change {"kind": "extension", "name": "health_check", "status": "ready"}
2024-11-05T21:52:31.288Z info [email protected]/service.go:230 Everything is ready. Begin running and processing data.
2024-11-05T21:54:17.696Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-11-05T21:54:22.534Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:27.568Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:32.598Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:37.644Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:42.476Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:47.515Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:52.540Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:57.582Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:02.623Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:07.649Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:12.472Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:17.504Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:22.546Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:27.570Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:32.611Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:43.479Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-11-05T21:55:47.498Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-11-05T21:55:52.541Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:57.578Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:02.630Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:07.661Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:12.496Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:17.527Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:22.566Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:27.607Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:32.656Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:37.504Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:42.524Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:47.545Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:52.580Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:57.630Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:02.474Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:07.505Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:12.537Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:17.570Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:22.597Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:27.630Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:32.453Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:37.486Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:42.529Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:47.560Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
Why the "request" metric stops being emitted while the .Net application is running :
it seems that adding :
GC.Collect(); help to reduce the memory footprint :
This is by design.
When you dispose of an object it marks it available for garbage collection.
But if there is plenty of memory available (no memory pressure), then there is no reason to waste time running garbage collection.
i.e. An 8 GB machine will only run half as many GCs as a 4 GB machine, as it can allow the memory to grow bigger.
Don't manually run GC.Collect() -- the system will run it when needed, i.e. when there is memory pressure (memory is getting full).
Having the same observation. Frequently disposing and recreating the meter creates a memory leak. Also have the same requirement as OP.
Related: https://github.com/open-telemetry/opentelemetry-dotnet/issues/2524#issuecomment-2856946837