opentelemetry-dotnet icon indicating copy to clipboard operation
opentelemetry-dotnet copied to clipboard

[bug] - Memory leak

Open LGouellec opened this issue 1 year ago • 14 comments

Package

OpenTelemetry

Package Version

Package Name Version
OpenTelemetry.Api 1.9.0
OpenTelemetry 1.9.0
OpenTelemetry.Exporter.Prometheus.HttpListener 1.9.0-beta.2
OpenTelemetry.Extensions.Hosting 1.9.0
OpenTelemetry.Instrumentation.Runtime 1.9.0

Runtime Version

net8.0

Description

I need to Dispose(..) often my Meter instance, because today Meter doesn't provide a way to delete an existing metric dynamically. See : https://github.com/dotnet/runtime/issues/83822.

When I hit meter.Dispose(), the dotnet runtime will notify each instrument perviously created that they are not longer published. See: https://github.com/dotnet/runtime/blob/9e59acb298c20658788567e0c6f0793fe97d37f6/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/Meter.cs#L519

The OpenTelemetry MetricReader (in my case BaseExportingMetricReader) must deactivateMetric : See: https://github.com/open-telemetry/opentelemetry-dotnet/blob/5dff99f8a0b26ff75248d5f2e478b9c3c42f5678/src/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs#L30

But it seems there is a memory leak, because when I run my program for a long time, the memory still growing. When I debug my app, I see a lot of MetricPoint[] instances unreleased. More time my application run, more memory consumption is required only for MetricPoint[] instances. It seems when meter.Dispose() is called, the open telemetry metrics are not released properly.

Screen of my debugger at 10 min uptimes : image

Screen of my debugger at 60 min uptimes (122 Mb of MetricsPoint[] ) image

Steps to Reproduce

    public static void Main(string[] args)
    {
        Meter meter = null;
        var meterProviderBuilder = Sdk
            .CreateMeterProviderBuilder()
            .AddMeter("Test")
            .SetResourceBuilder(
                ResourceBuilder.CreateDefault()
                    .AddService(serviceName: "Test"));
        
        meterProviderBuilder.AddPrometheusHttpListener();
        meterProviderBuilder.AddRuntimeInstrumentation();
        
        var tracerProvider = meterProviderBuilder.Build();
        
        while(true){
            meter?.Dispose();
            meter = new Meter("Test");
            var rd = new Random();
            
            meter.CreateObservableGauge(
                "requests", 
                () => new[] {
                    new Measurement<double>(rd.NextDouble() * rd.Next(100))
                },
                description: "Request per second");
            
             // will see after couple of minutes that the MetricReader contains a lot of MetricPoint[], even if we dispose the Meter after each iteration
            Thread.Sleep(200);
        }
    }

Expected Result

All perviously MetricPoint released

Actual Result

Memory leak

Additional Context

No response

LGouellec avatar Oct 23 '24 19:10 LGouellec

For tracking : https://github.com/LGouellec/streamiz/issues/361

LGouellec avatar Oct 23 '24 19:10 LGouellec

meter?.Dispose(); meter = new Meter("Test");

This looks like a new Meter of same name is created right after disposing it? Not sure what is the intend here.

cijothomas avatar Oct 24 '24 01:10 cijothomas

@cijothomas

meter?.Dispose(); meter = new Meter("Test");

It's just a reproducer, my initial use case is to expose OpenTelemetry metrics for a Kafka Streams application. Don't know if you are familiar with Kafka, but the assignment of partitions can change, so it means you have to drop some metrics in some cases.

Today, we can't drop a metric for a specific Meter instance. Open Issue here : https://github.com/dotnet/runtime/issues/83822 Discussion here : https://github.com/open-telemetry/opentelemetry-specification/issues/3062

So that's why for now, I dispose the current Meter instance, and recreate from scratch for each iteration. With that, I don't need to drop metrics, just not exposing the old metrics in the next iteration.

My point here is when the Meter instance is disposed, the open telemetry metrics are not released properly.

LGouellec avatar Oct 24 '24 14:10 LGouellec

@cijothomas

Is there a way in OpenTelemetry to clear and dispose all the resources without closing the MeterProvider ?

LGouellec avatar Oct 25 '24 16:10 LGouellec

Hey @LGouellec!

I spent some time on this tonight. Here is what I think is going on...

  • When you dispose your Meter we track the Metric as deactivated: https://github.com/open-telemetry/opentelemetry-dotnet/blob/9f41eadf03f3dcc5e76c686b61fb39849f046312/src/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs#L29 But nothing happens until...

  • ...a collection is run. Once we have exported the data, we clean up the Metric: https://github.com/open-telemetry/opentelemetry-dotnet/blob/9f41eadf03f3dcc5e76c686b61fb39849f046312/src/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs#L275

The problem is in your repro there you are using AddPrometheusHttpListener which is a pull exporter. No collection will run unless you cause one.

Do me a favor and switch your code to OtlpExporter and re-run the test. That is a push exporter meaning it will run periodically and cause a cleanup. I suspect if you run that the memory will be more stable.

Should we do something here? I'm not sure what we could do. Open to ideas. But I will say, what you are proving by constantly disposing Meters I don't think is how any of the APIs were designed to be used 😄

CodeBlanch avatar Oct 26 '24 06:10 CodeBlanch

Hello @cijothomas ,

I tried with the OtlpExporter :

meterProviderBuilder.AddOtlpExporter(options => {
    options.Protocol = OtlpExportProtocol.Grpc;
    options.ExportProcessorType = ExportProcessorType.Batch;
});

Same results, metrics are correctly exporting in the OtelCollector, but the memory keep growing locally due to a unreleased MetricPoint : image Until we can delete metrics in Meter, we can't have stable memory.

One question is remaining, Why you don't clear the metric point when you receive the notification from Meter ?

LGouellec avatar Oct 28 '24 21:10 LGouellec

One question is remaining, Why you don't clear the metric point when you receive the notification from Meter ?

As clarified here, Metric points are cleared after the next export only, not immediately upon Meter dispose, as we still want to export pending updates before throwing away MetricPoints.

Are you seeing that OpenTelemetry sdk is not releasing MetricPoints even after export?

cijothomas avatar Oct 29 '24 04:10 cijothomas

Are you seeing that OpenTelemetry sdk is not releasing MetricPoints even after export?

Yes exactly, the memory keep growing, the number of MetricPoint[] continue to increase over time, even after multiple minutes. So no I'm pretty sure that the metrics point are not cleared after the next export.

LGouellec avatar Oct 29 '24 17:10 LGouellec

Are you seeing that OpenTelemetry sdk is not releasing MetricPoints even after export?

Yes exactly, the memory keep growing, the number of MetricPoint[] continue to increase over time, even after multiple minutes. So no I'm pretty sure that the metrics point are not cleared after the next export.

Could you confirm that Meter is disposed as well? If Meter is not disposed, then MetricPoints are not freed up by default.

cijothomas avatar Oct 31 '24 22:10 cijothomas

Running this:

public static void Main()
{
    Meter? meter = null;

    var meterProviderBuilder = Sdk
        .CreateMeterProviderBuilder()
        .AddMeter("Test")
        .SetResourceBuilder(
            ResourceBuilder.CreateDefault()
                .AddService(serviceName: "Test"));

    meterProviderBuilder.AddOtlpExporter((exporterOptions, readerOptions) =>
    {
        readerOptions.PeriodicExportingMetricReaderOptions.ExportIntervalMilliseconds = 5000;
    });

    using var meterProvider = meterProviderBuilder.Build();

    while (true)
    {
        meter?.Dispose();
        meter = new Meter("Test");
        var rd = new Random();

        meter.CreateObservableGauge(
            "requests",
            () => new[]
            {
                new Measurement<double>(rd.NextDouble() * rd.Next(100)),
            },
            description: "Request per second");

        // will see after couple of minutes that the MetricReader contains a lot of MetricPoint[], even if we dispose the Meter after each iteration
        Thread.Sleep(200);
    }
}

Seems to be doing its job nicely:

image

The red bars on "Object delta (%change)" are objects being collected.

CodeBlanch avatar Oct 31 '24 23:10 CodeBlanch

@CodeBlanch Running the same snippet of code, and the number of MetricPoint[] is huge no ? 54Mb just for MetricPoint[] 375 of MetricPoint[], and each object contains 2002 instances of MetricPoint. image image

Running .NET 8.0

@cijothomas Meter is disposed every times like in the snippet of code.

LGouellec avatar Nov 04 '24 20:11 LGouellec

@CodeBlanch it seems that adding :

GC.Collect();

help to reduce the memory footprint :

Example:

        public static void Main()
        {
            Meter? meter = null;

            var meterProviderBuilder = Sdk
                .CreateMeterProviderBuilder()
                .AddMeter("Test")
                .SetResourceBuilder(
                    ResourceBuilder.CreateDefault()
                        .AddService(serviceName: "Test"));

            meterProviderBuilder.AddOtlpExporter((exporterOptions, readerOptions) =>
            {
                readerOptions.PeriodicExportingMetricReaderOptions.ExportIntervalMilliseconds = 5000;
            });

            using var meterProvider = meterProviderBuilder.Build();

            while (true)
            {
                meter?.Dispose();
                GC.Collect();
                meter = new Meter("Test");
                var rd = new Random();

                meter.CreateObservableGauge(
                    "requests",
                    () => new[]
                    {
                        new Measurement<double>(rd.NextDouble() * rd.Next(100)),
                    },
                    description: "Request per second");

                // will see after couple of minutes that the MetricReader contains a lot of MetricPoint[], even if we dispose the Meter after each iteration
                Thread.Sleep(200);
            }
        }

But I have a question, I have running my application more than 15 minutes, and at some point it seems that the metrics are no longer exported, do you know what's going on ? Otel Collector Logs

2024-11-05T21:52:31.269Z        info    [email protected]/service.go:135 Setting up own telemetry...
2024-11-05T21:52:31.274Z        info    telemetry/metrics.go:70 Serving metrics {"address": "localhost:8888", "metrics level": "Normal"}
2024-11-05T21:52:31.275Z        info    builders/builders.go:26 Development component. May change in the future.        {"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-11-05T21:52:31.283Z        info    [email protected]/service.go:207 Starting otelcol...     {"Version": "0.112.0", "NumCPU": 8}
2024-11-05T21:52:31.283Z        info    extensions/extensions.go:39     Starting extensions...
2024-11-05T21:52:31.283Z        info    extensions/extensions.go:42     Extension is starting...        {"kind": "extension", "name": "health_check"}
2024-11-05T21:52:31.283Z        info    [email protected]/healthcheckextension.go:33        Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-11-05T21:52:31.283Z        info    extensions/extensions.go:59     Extension started.      {"kind": "extension", "name": "health_check"}
2024-11-05T21:52:31.283Z        info    extensions/extensions.go:42     Extension is starting...        {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.283Z        info    [email protected]/zpagesextension.go:55  Registered zPages span processor on tracer provider     {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.284Z        info    [email protected]/zpagesextension.go:65  Registered Host's zPages        {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.284Z        info    [email protected]/zpagesextension.go:77  Starting zPages extension       {"kind": "extension", "name": "zpages", "config": {"Endpoint":":55679","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0}}
2024-11-05T21:52:31.284Z        warn    [email protected]/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. {"kind": "extension", "name": "zpages", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-11-05T21:52:31.285Z        info    extensions/extensions.go:59     Extension started.      {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.285Z        info    extensions/extensions.go:42     Extension is starting...        {"kind": "extension", "name": "pprof"}
2024-11-05T21:52:31.285Z        info    [email protected]/pprofextension.go:61    Starting net/http/pprof server  {"kind": "extension", "name": "pprof", "config": {"TCPAddr":{"Endpoint":":1888","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
2024-11-05T21:52:31.285Z        info    extensions/extensions.go:59     Extension started.      {"kind": "extension", "name": "pprof"}
2024-11-05T21:52:31.285Z        warn    [email protected]/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. {"kind": "exporter", "data_type": "metrics", "name": "prometheus", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-11-05T21:52:31.285Z        warn    [email protected]/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. {"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-11-05T21:52:31.285Z        info    [email protected]/otlp.go:112       Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "0.0.0.0:4317"}
2024-11-05T21:52:31.286Z        info    healthcheck/handler.go:132      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
2024-11-05T21:52:31.288Z        info    [email protected]/service.go:230 Everything is ready. Begin running and processing data.
2024-11-05T21:54:17.696Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-11-05T21:54:22.534Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:27.568Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:32.598Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:37.644Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:42.476Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:47.515Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:52.540Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:57.582Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:02.623Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:07.649Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:12.472Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:17.504Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:22.546Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:27.570Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:32.611Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:43.479Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-11-05T21:55:47.498Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-11-05T21:55:52.541Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:57.578Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:02.630Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:07.661Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:12.496Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:17.527Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:22.566Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:27.607Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:32.656Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:37.504Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:42.524Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:47.545Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:52.580Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:57.630Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:02.474Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:07.505Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:12.537Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:17.570Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:22.597Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:27.630Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:32.453Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:37.486Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:42.529Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:47.560Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
image

Why the "request" metric stops being emitted while the .Net application is running :

image

LGouellec avatar Nov 05 '24 22:11 LGouellec

it seems that adding :

GC.Collect(); help to reduce the memory footprint :

This is by design.

When you dispose of an object it marks it available for garbage collection.

But if there is plenty of memory available (no memory pressure), then there is no reason to waste time running garbage collection.

i.e. An 8 GB machine will only run half as many GCs as a 4 GB machine, as it can allow the memory to grow bigger.

Don't manually run GC.Collect() -- the system will run it when needed, i.e. when there is memory pressure (memory is getting full).

sgryphon avatar Feb 01 '25 12:02 sgryphon

Having the same observation. Frequently disposing and recreating the meter creates a memory leak. Also have the same requirement as OP.

Related: https://github.com/open-telemetry/opentelemetry-dotnet/issues/2524#issuecomment-2856946837

yannic-hamann-abb avatar May 13 '25 11:05 yannic-hamann-abb