opentelemetry-go grpc metrics exporter client starts and keeps logging context deadline exceeded after stably running for about 2.5 hours

trafficstars

Description

The grpc metrics exporter client keeps consistently logging context deadline exceeded after stably running for about 2.5 hours. The server it is reporting to, an open telemetry collector, is still running but the client disconnects and is just not able to reconnect for some reason. If I start another process using the client and connect to the collector otlp grpc receiver endpoint, it's able to connect. Also this happens in while running on an ubuntu based container in k8s (Both the collector and client are running there). I could not reproduce it on mac os(darwin).

I switched to the http metrics exporter client and I did not observe this problem. I was previously using v0.18.0 of opentelemetry-go and did not observe this issue.

Environment

OS: Ubuntu 20.04 based docker container
Env: k8s
Architecture: x86
Go Version: 1.16
opentelemetry-go version: v1.0.0-RC2 (upgraded from v0.18.0)
opentelemetry-go metric version: v0.22.0 (upgraded from v0.18.0)
opentelemetry collector version: v0.33.0 (upgraded from v0.29.0)

Collector log with timestamp for when the receiver started:

2021-08-29T22:00:25.175Z	info	otlpreceiver/otlp.go:75	Starting GRPC server on endpoint 0.0.0.0:4317	{"kind": "receiver", "name": "otlp"}

Exporter metrics client logs "context deadline exceeded" start happening after about 2.5 hours.

2021/08/30 00:24:10 context deadline exceeded
2021/08/30 00:24:21 context deadline exceeded
2021/08/30 00:24:21 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:23 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:25 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:40 context deadline exceeded
2021/08/30 00:24:41 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:55 context deadline exceeded
2021/08/30 00:24:55 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:15 context deadline exceeded
2021/08/30 00:25:25 context deadline exceeded
2021/08/30 00:25:25 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:27 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:29 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:40 context deadline exceeded
2021/08/30 00:25:41 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:52 context deadline exceeded

They do not stop till the process running the client is killed.

Steps To Reproduce

Use the configuration or similar below to start an opentelemetry collector(using v0.33.0):

# opentelemetry collector v0.33.0
extensions:
  health_check:
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

receivers:
  otlp:
    protocols:
      grpc:
        max_recv_msg_size_mib: 4
      http:
  opencensus:

  # Collect own metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: "my-collector"
          scrape_interval: 10s
          static_configs:
            - targets: ["0.0.0.0:8888"]
  jaeger:
    protocols:
      grpc:
      thrift_binary:
      thrift_compact:
      thrift_http:
  zipkin:

processors:
  batch:
    timeout: 200ms
    send_batch_size: 8192
    send_batch_max_size: 10000
  # Some other customer processors

exporters:
  otlp:
    endpoint: "myendpoint:443"
    compression: gzip
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: traceableai

service:
  pipelines:
    traces:
      receivers: [otlp, opencensus, zipkin, jaeger, ]
      processors: [batch] # Some other custom trace processors were here as well
      exporters: [otlp]
    metrics:
      receivers: [otlp, opencensus, prometheus]
      processors: [batch]
      exporters: [otlp, prometheus] # 2 exporters. Now testing with 1 - otlp - see if it crashes

  extensions: [health_check, pprof, zpages]

Use the code snippet below to set up and start a simple metrics client that creates a single metric every 5 seconds. The otlp metric packages version is v0.22.0

package main

import (
	"context"
	"time"

	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
	//"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
	"go.opentelemetry.io/otel/metric"
	otelmetricglobal "go.opentelemetry.io/otel/metric/global"
	controller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
	processor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
	"go.opentelemetry.io/otel/sdk/metric/selector/simple"
	"go.uber.org/zap"
)

func main() {
	logger, _ := zap.NewProduction()
	defer logger.Sync()

	// Set up grpc client
	metricsClient := otlpmetricgrpc.NewClient(
		otlpmetricgrpc.WithInsecure(),
		otlpmetricgrpc.WithEndpoint("localhost:4317"),
	)
	// Set up http client. This one works
	// Uncomment to switch
	// metricsClient := otlpmetrichttp.NewClient(
	// 	otlpmetrichttp.WithInsecure(),
	// 	otlpmetrichttp.WithEndpoint("localhost:4318"),
	// )
	metricsExporter := otlpmetric.NewUnstarted(metricsClient)

	// Set up controller
	ctlr := controller.New(
		processor.New(
			simple.NewWithExactDistribution(),
			metricsExporter,
		),
		controller.WithExporter(metricsExporter),
		controller.WithCollectPeriod(2*time.Second),
	)
	// Set meter provider
	otelmetricglobal.SetMeterProvider(ctlr.MeterProvider())

	// Start controller and exporter
	handleErr(ctlr.Start(context.Background()), "failed to start controller", logger)
	handleErr(metricsClient.Start(context.Background()), "failed to start exporter", logger)

	// defer cleanup when done
	defer func() {
		// Push any last metric events to the exporter.
		handleErr(ctlr.Stop(context.Background()), "failed to stop controller", logger)
		handleErr(metricsExporter.Shutdown(context.Background()), "failed to stop metrics exporter", logger)
	}()

	// Start a ticker to send a heartbeat metric every 5 seconds or send some other metric of your choice
	ticker := time.NewTicker(5 * time.Second)
	heartBeatTickerDone := make(chan struct{}, 1)
	defer ticker.Stop()

	metricMeter := createMeter()
	valueRecorder := metric.Must(metricMeter).NewInt64ValueRecorder("heartbeat")
	for {
		select {
		case <-heartBeatTickerDone:
			//wg.Done()
			return
		case <-ticker.C:
			logger.Info("Creating metric")
			valueRecorder.Record(context.Background(), 1)
		}
	}
}

func handleErr(err error, message string, logger *zap.Logger) {
	if err != nil {
		logger.Error(message, zap.Error(err))
	}
}

func createMeter() metric.Meter {
	return otelmetricglobal.Meter("test-grpc-metric-expoter")
}

See error above after running for about 2.5 hours

Expected behavior

It should continue to happily report metrics to the otel collector for however long it runs.

Sep 02 '21 21:09 tim-mwangi

By the way, this happens when I ran the container on GKE k8s but I cannot reproduce it on docker-desktop k8s which is quite strange.

Sep 08 '21 14:09 tim-mwangi

This is also reproducible by using otlptracegrpc.New(ctx, otlptracegrpc.WithTimeout(some short timeout)) and not having the collector running. After a span ends it will try to connect and fail and the log line will be printed.

Nov 18 '21 13:11 disq

The connection handling to for the otlpmetricgrpc client was recently refactored (https://github.com/open-telemetry/opentelemetry-go/pull/2425) to use the grpc ClientConn to handle connections. This bug needs to be verified to still be relevant with these added changes.

Dec 06 '21 19:12 MrAlias

This is still a bug for otlpmetricgrpc too. The reason is that the current implementation of exporter.Export / client.UploadMetrics by default sends all metrics in a collection / snapshot in one ExportMetricsServiceRequest. If the amount of metrics is too large to finish sending within the pushTimeout, or if the request size exceeds the server's grpc.MaxRecvMessage, the request will fail with a timeout error. The solution maybe batching large payloads into multiple requests or streaming gRPC.

Jan 29 '22 02:01 hardproblems

opentelemetry-go opentelemetry-go copied to clipboard

grpc metrics exporter client starts and keeps logging context deadline exceeded after stably running for about 2.5 hours

Description

Environment

Steps To Reproduce

Expected behavior

opentelemetry-go
opentelemetry-go copied to clipboard