opentelemetry-go-contrib
opentelemetry-go-contrib copied to clipboard
Memory leak caused by `otelhttp` when no Meter Provider is configured.
Description
When calling otelhttp.NewTransport() multiple times, memory is leaked.
Supposedly this was resolved by https://github.com/open-telemetry/opentelemetry-go/pull/4820 but I do not think this fixes the issue when no meter provider has been configured and the default global delegating meter provider is in place.
Related: https://github.com/open-telemetry/opentelemetry-go-contrib/issues/4226 Introduced by: https://github.com/open-telemetry/opentelemetry-go-contrib/pull/4707
Environment
go version go1.22.0 darwin/arm64
go.opentelemetry.io/contrib/instrumentation/github.com/aws/aws-sdk-go-v2/otelaws v0.49.0
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.49.0
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.49.0
go.opentelemetry.io/otel v1.24.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.24.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.24.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.24.0
go.opentelemetry.io/otel/metric v1.24.0
go.opentelemetry.io/otel/sdk v1.24.0
go.opentelemetry.io/otel/trace v1.24.0
go.opentelemetry.io/proto/otlp v1.1.0
Steps To Reproduce
package bench
import (
"testing"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/metric/noop"
)
func BenchmarkWithNoMeterProviderSet(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
otelhttp.NewTransport(nil)
}
}
func BenchmarkWithNoopMeterProviderSet(b *testing.B) {
b.ReportAllocs()
otel.SetMeterProvider(noop.MeterProvider{})
for i := 0; i < b.N; i++ {
otelhttp.NewTransport(nil)
}
}
Run each of these with test.memprofile and analyze the inuse heap objects/space difference between them.
Alternatively, write an application which calls otelhttp.NewTransport and analyze heap usage.
Expected behavior
Calling otelhttp.NewTransport() repeatedly doesn't leak memory
Workaround
Call otel.SetMeterProvider(noop.MeterProvider{}) at the start of your application.
We also ran into this issue at Arista.
Sorry I didn't see this the first time around. Taking a look now
I am able to reproduce the issue with these benchmarks:
package bench
import (
"testing"
"go.opentelemetry.io/otel"
)
const (
meterName = "foo"
metricName = "bar"
)
func BenchmarkSimple(b *testing.B) {
b.ReportAllocs()
meter := otel.Meter(meterName)
for i := 0; i < b.N; i++ {
meter.Int64Counter(metricName)
}
}
It has 1 allocation. Since we are repeatedly creating the same instrument, it should have 0 allocations. If I replace otel.Meter with a noop meterprovider or the real meter provider, it has 0 allocations. I believe the root cause is that we always append the instrument to the list of instruments for the global meter provider: e.g. https://github.com/open-telemetry/opentelemetry-go/blob/d776f74f7cd9cdaf5e4bb97aeb6b7e2d102e7d68/internal/global/meter.go#L142
We should instead maintain a map of instruments, and ensure that we only keep unique instruments, similar to what we did for the SDK.
The workaround provided above is correct.
I believe the root cause is that we always append the instrument to the list of instruments for the global meter provider: e.g. https://github.com/open-telemetry/opentelemetry-go/blob/d776f74f7cd9cdaf5e4bb97aeb6b7e2d102e7d68/internal/global/meter.go#L142
Looking back at my notes when I wrote this ticket - I came to the same conclusion so I think you're on the right lines here. Thank you for confirming the issue and looking into it 🙏
https://github.com/open-telemetry/opentelemetry-go/pull/5754
Fixed by https://github.com/open-telemetry/opentelemetry-go/pull/5754
I am pretty sure this is still an issue or something else in the golang otel ecosystem. I will get a pprof setup possibly tomorrow, but here's some anecdotal evidence I have:
Pretty easy to see when tracing was implemented from that graph. And yes. I have removed our tracing implementation and it's back to normal memory usage.
Here is a rough draft of our setup. Please let me know if I am doing anything egregiously dumb, but for the most part, it's all pretty standard stuff take from various docs:
go.mod
go 1.23.2
require (
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.56.0
go.opentelemetry.io/otel v1.31.0
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.31.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.31.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.31.0
go.opentelemetry.io/otel/sdk v1.31.0
go.opentelemetry.io/otel/trace v1.31.0
)
We are wrapping the otelhttp.NewHandler around the toplevel muxer, so everything is traced. Yes, I know this is expensive, but it shouldn't leak memory. Eventually we will change this to include/exclude/drop stuff, just so we aren't taking in so much volume (ping routes, health checks, etc.) and do more aggressive down sampling.
func NewApi(c config.Config) *Api {
return &Api{
c: &c,
controllers: []Controller{newControllers(c)},
server: &http.Server{
ReadTimeout: c.ReadTimeout,
WriteTimeout: c.WriteTimeout,
IdleTimeout: c.IdleTimeout,
Addr: fmt.Sprintf(":%d", c.Port),
Handler: otelhttp.NewHandler(chi.NewMux(), "INGRESS", otelhttp.WithFilter(traceFilter)),
},
done: make(chan bool),
sigChannel: make(chan os.Signal, 1024),
}
}
Here is how we are initializing our trace and metrics providers once on boot:
// TracerProvider an OTLP exporter, and configures the corresponding trace provider.
func TracerProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
// Set up a trace exporter
traceExporter, err := otlptrace.New(ctx, otlptracegrpc.NewClient())
if err != nil {
return nil, errors.Wrap(err, "failed to create trace exporter")
}
// Register the trace exporter with a TracerProvider, using a batch
// span processor to aggregate spans before export.
tracerProvider := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithResource(res),
sdktrace.WithBatcher(traceExporter),
)
otel.SetTracerProvider(tracerProvider)
otel.SetTextMapPropagator(
propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
// Shutdown will flush any remaining spans and shut down the exporter.
return tracerProvider.Shutdown, nil
}
// MeterProvider an OTLP exporter, and configures the corresponding meter provider.
func MeterProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
metricExporter, err := otlpmetricgrpc.New(ctx)
if err != nil {
return nil, errors.Wrap(err, "failed to create metric exporter")
}
meterProvider := sdkmetric.NewMeterProvider(
sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExporter)),
sdkmetric.WithResource(res),
)
otel.SetMeterProvider(meterProvider)
return meterProvider.Shutdown, nil
}
Then called and shutdown on main:
shutDownTracer, err := traceinstrument.TracerProvider(ctx, traceRes)
if err != nil {
log.Logger.Fatal("failed to create trace provider", zap.Error(err))
}
defer func(onShutdown func(ctx context.Context) error) {
if errr := onShutdown(ctx); errr != nil {
log.Logger.Error("error shutting down trace provider", zap.Error(errr))
}
}(shutDownTracer)
shutdownTraceMetrics, err := traceinstrument.MeterProvider(ctx, traceRes)
if err != nil {
log.Logger.Fatal("failed to create meter provider", zap.Error(err))
}
defer func(onShutdown func(ctx context.Context) error) {
if errr := onShutdown(ctx); errr != nil {
log.Logger.Error("error shutting down metrics provider", zap.Error(errr))
}
}(shutdownTraceMetrics)
Note. We are also using the otelhttp.NewTransport to wrap the default logging transport:
http.DefaultTransport = otelhttp.NewTransport(http.DefaultTransport)
If we remove tracing setup, memory usage goes back to normal. So the leak is definitely in our tracing setup.
You should open a new issue for this, as this isn't the same issue since you're configuring a meter provider.