opentelemetry-go-contrib icon indicating copy to clipboard operation
opentelemetry-go-contrib copied to clipboard

Memory leak caused by `otelhttp` when no Meter Provider is configured.

Open strideynet opened this issue 1 year ago • 1 comments

Description

When calling otelhttp.NewTransport() multiple times, memory is leaked.

image

Supposedly this was resolved by https://github.com/open-telemetry/opentelemetry-go/pull/4820 but I do not think this fixes the issue when no meter provider has been configured and the default global delegating meter provider is in place.

Related: https://github.com/open-telemetry/opentelemetry-go-contrib/issues/4226 Introduced by: https://github.com/open-telemetry/opentelemetry-go-contrib/pull/4707

Environment

go version go1.22.0 darwin/arm64
	go.opentelemetry.io/contrib/instrumentation/github.com/aws/aws-sdk-go-v2/otelaws v0.49.0
	go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.49.0
	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.49.0
	go.opentelemetry.io/otel v1.24.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.24.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.24.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.24.0
	go.opentelemetry.io/otel/metric v1.24.0
	go.opentelemetry.io/otel/sdk v1.24.0
	go.opentelemetry.io/otel/trace v1.24.0
	go.opentelemetry.io/proto/otlp v1.1.0

Steps To Reproduce

package bench

import (
	"testing"

	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/metric/noop"
)

func BenchmarkWithNoMeterProviderSet(b *testing.B) {
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		otelhttp.NewTransport(nil)
	}
}

func BenchmarkWithNoopMeterProviderSet(b *testing.B) {
	b.ReportAllocs()
	otel.SetMeterProvider(noop.MeterProvider{})
	for i := 0; i < b.N; i++ {
		otelhttp.NewTransport(nil)
	}
}

Run each of these with test.memprofile and analyze the inuse heap objects/space difference between them.

Alternatively, write an application which calls otelhttp.NewTransport and analyze heap usage.

Expected behavior

Calling otelhttp.NewTransport() repeatedly doesn't leak memory

Workaround

Call otel.SetMeterProvider(noop.MeterProvider{}) at the start of your application.

strideynet avatar Mar 01 '24 11:03 strideynet

We also ran into this issue at Arista.

tsuna avatar Mar 28 '24 12:03 tsuna

Sorry I didn't see this the first time around. Taking a look now

dashpole avatar Aug 28 '24 00:08 dashpole

I am able to reproduce the issue with these benchmarks:

package bench

import (
	"testing"

	"go.opentelemetry.io/otel"
)

const (
	meterName   = "foo"
	metricName  = "bar"
)

func BenchmarkSimple(b *testing.B) {
	b.ReportAllocs()
	meter := otel.Meter(meterName)
	for i := 0; i < b.N; i++ {
		meter.Int64Counter(metricName)
	}
}

It has 1 allocation. Since we are repeatedly creating the same instrument, it should have 0 allocations. If I replace otel.Meter with a noop meterprovider or the real meter provider, it has 0 allocations. I believe the root cause is that we always append the instrument to the list of instruments for the global meter provider: e.g. https://github.com/open-telemetry/opentelemetry-go/blob/d776f74f7cd9cdaf5e4bb97aeb6b7e2d102e7d68/internal/global/meter.go#L142

We should instead maintain a map of instruments, and ensure that we only keep unique instruments, similar to what we did for the SDK.

dashpole avatar Aug 28 '24 01:08 dashpole

The workaround provided above is correct.

dashpole avatar Aug 28 '24 01:08 dashpole

I believe the root cause is that we always append the instrument to the list of instruments for the global meter provider: e.g. https://github.com/open-telemetry/opentelemetry-go/blob/d776f74f7cd9cdaf5e4bb97aeb6b7e2d102e7d68/internal/global/meter.go#L142

Looking back at my notes when I wrote this ticket - I came to the same conclusion so I think you're on the right lines here. Thank you for confirming the issue and looking into it 🙏

strideynet avatar Aug 28 '24 08:08 strideynet

https://github.com/open-telemetry/opentelemetry-go/pull/5754

dashpole avatar Aug 28 '24 16:08 dashpole

Fixed by https://github.com/open-telemetry/opentelemetry-go/pull/5754

dashpole avatar Aug 29 '24 17:08 dashpole

I am pretty sure this is still an issue or something else in the golang otel ecosystem. I will get a pprof setup possibly tomorrow, but here's some anecdotal evidence I have:

Screenshot 2024-11-07 at 6 49 14 PM

Pretty easy to see when tracing was implemented from that graph. And yes. I have removed our tracing implementation and it's back to normal memory usage.

Here is a rough draft of our setup. Please let me know if I am doing anything egregiously dumb, but for the most part, it's all pretty standard stuff take from various docs:

go.mod

go 1.23.2

require (
	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.56.0
	go.opentelemetry.io/otel v1.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.31.0
	go.opentelemetry.io/otel/sdk v1.31.0
	go.opentelemetry.io/otel/trace v1.31.0
)

We are wrapping the otelhttp.NewHandler around the toplevel muxer, so everything is traced. Yes, I know this is expensive, but it shouldn't leak memory. Eventually we will change this to include/exclude/drop stuff, just so we aren't taking in so much volume (ping routes, health checks, etc.) and do more aggressive down sampling.

func NewApi(c config.Config) *Api {
	return &Api{
		c:           &c,
		controllers: []Controller{newControllers(c)},
		server: &http.Server{
			ReadTimeout:  c.ReadTimeout,
			WriteTimeout: c.WriteTimeout,
			IdleTimeout:  c.IdleTimeout,
			Addr:         fmt.Sprintf(":%d", c.Port),
			Handler:      otelhttp.NewHandler(chi.NewMux(), "INGRESS", otelhttp.WithFilter(traceFilter)),
		},
		done:       make(chan bool),
		sigChannel: make(chan os.Signal, 1024),
	}
}

Here is how we are initializing our trace and metrics providers once on boot:


// TracerProvider an OTLP exporter, and configures the corresponding trace provider.
func TracerProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
	// Set up a trace exporter
	traceExporter, err := otlptrace.New(ctx, otlptracegrpc.NewClient())
	if err != nil {
		return nil, errors.Wrap(err, "failed to create trace exporter")
	}

	// Register the trace exporter with a TracerProvider, using a batch
	// span processor to aggregate spans before export.
	tracerProvider := sdktrace.NewTracerProvider(
		sdktrace.WithSampler(sdktrace.AlwaysSample()),
		sdktrace.WithResource(res),
		sdktrace.WithBatcher(traceExporter),
	)
	otel.SetTracerProvider(tracerProvider)
	otel.SetTextMapPropagator(
		propagation.NewCompositeTextMapPropagator(
			propagation.TraceContext{},
			propagation.Baggage{},
		))

	// Shutdown will flush any remaining spans and shut down the exporter.
	return tracerProvider.Shutdown, nil
}

// MeterProvider an OTLP exporter, and configures the corresponding meter provider.
func MeterProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
	metricExporter, err := otlpmetricgrpc.New(ctx)
	if err != nil {
		return nil, errors.Wrap(err, "failed to create metric exporter")
	}

	meterProvider := sdkmetric.NewMeterProvider(
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExporter)),
		sdkmetric.WithResource(res),
	)
	otel.SetMeterProvider(meterProvider)

	return meterProvider.Shutdown, nil
}

Then called and shutdown on main:

	shutDownTracer, err := traceinstrument.TracerProvider(ctx, traceRes)
	if err != nil {
		log.Logger.Fatal("failed to create trace provider", zap.Error(err))
	}
	
	defer func(onShutdown func(ctx context.Context) error) {
		if errr := onShutdown(ctx); errr != nil {
			log.Logger.Error("error shutting down trace provider", zap.Error(errr))
		}
	}(shutDownTracer)

	shutdownTraceMetrics, err := traceinstrument.MeterProvider(ctx, traceRes)
	if err != nil {
		log.Logger.Fatal("failed to create meter provider", zap.Error(err))
	}

	defer func(onShutdown func(ctx context.Context) error) {
		if errr := onShutdown(ctx); errr != nil {
			log.Logger.Error("error shutting down metrics provider", zap.Error(errr))
		}
	}(shutdownTraceMetrics)

Note. We are also using the otelhttp.NewTransport to wrap the default logging transport:

http.DefaultTransport = otelhttp.NewTransport(http.DefaultTransport)

If we remove tracing setup, memory usage goes back to normal. So the leak is definitely in our tracing setup.

tjsampson avatar Nov 08 '24 02:11 tjsampson

You should open a new issue for this, as this isn't the same issue since you're configuring a meter provider.

dmathieu avatar Nov 08 '24 08:11 dmathieu