opentelemetry-go
opentelemetry-go copied to clipboard
grpc metrics exporter client starts and keeps logging context deadline exceeded after stably running for about 2.5 hours
Description
The grpc metrics exporter client keeps consistently logging context deadline exceeded after stably running for about 2.5 hours. The server it is reporting to, an open telemetry collector, is still running but the client disconnects and is just not able to reconnect for some reason. If I start another process using the client and connect to the collector otlp grpc receiver endpoint, it's able to connect. Also this happens in while running on an ubuntu based container in k8s (Both the collector and client are running there). I could not reproduce it on mac os(darwin).
I switched to the http metrics exporter client and I did not observe this problem. I was previously using v0.18.0 of opentelemetry-go and did not observe this issue.
Environment
- OS: Ubuntu 20.04 based docker container
- Env: k8s
- Architecture: x86
- Go Version: 1.16
- opentelemetry-go version: v1.0.0-RC2 (upgraded from v0.18.0)
- opentelemetry-go metric version: v0.22.0 (upgraded from v0.18.0)
- opentelemetry collector version: v0.33.0 (upgraded from v0.29.0)
Collector log with timestamp for when the receiver started:
2021-08-29T22:00:25.175Z info otlpreceiver/otlp.go:75 Starting GRPC server on endpoint 0.0.0.0:4317 {"kind": "receiver", "name": "otlp"}
Exporter metrics client logs "context deadline exceeded" start happening after about 2.5 hours.
2021/08/30 00:24:10 context deadline exceeded
2021/08/30 00:24:21 context deadline exceeded
2021/08/30 00:24:21 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:23 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:25 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:40 context deadline exceeded
2021/08/30 00:24:41 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:55 context deadline exceeded
2021/08/30 00:24:55 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:15 context deadline exceeded
2021/08/30 00:25:25 context deadline exceeded
2021/08/30 00:25:25 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:27 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:29 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:40 context deadline exceeded
2021/08/30 00:25:41 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:52 context deadline exceeded
They do not stop till the process running the client is killed.
Steps To Reproduce
- Use the configuration or similar below to start an opentelemetry collector(using v0.33.0):
# opentelemetry collector v0.33.0
extensions:
health_check:
pprof:
endpoint: 0.0.0.0:1777
zpages:
endpoint: 0.0.0.0:55679
receivers:
otlp:
protocols:
grpc:
max_recv_msg_size_mib: 4
http:
opencensus:
# Collect own metrics
prometheus:
config:
scrape_configs:
- job_name: "my-collector"
scrape_interval: 10s
static_configs:
- targets: ["0.0.0.0:8888"]
jaeger:
protocols:
grpc:
thrift_binary:
thrift_compact:
thrift_http:
zipkin:
processors:
batch:
timeout: 200ms
send_batch_size: 8192
send_batch_max_size: 10000
# Some other customer processors
exporters:
otlp:
endpoint: "myendpoint:443"
compression: gzip
prometheus:
endpoint: "0.0.0.0:8889"
namespace: traceableai
service:
pipelines:
traces:
receivers: [otlp, opencensus, zipkin, jaeger, ]
processors: [batch] # Some other custom trace processors were here as well
exporters: [otlp]
metrics:
receivers: [otlp, opencensus, prometheus]
processors: [batch]
exporters: [otlp, prometheus] # 2 exporters. Now testing with 1 - otlp - see if it crashes
extensions: [health_check, pprof, zpages]
- Use the code snippet below to set up and start a simple metrics client that creates a single metric every 5 seconds. The otlp metric packages version is v0.22.0
package main
import (
"context"
"time"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
//"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
"go.opentelemetry.io/otel/metric"
otelmetricglobal "go.opentelemetry.io/otel/metric/global"
controller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
processor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
"go.opentelemetry.io/otel/sdk/metric/selector/simple"
"go.uber.org/zap"
)
func main() {
logger, _ := zap.NewProduction()
defer logger.Sync()
// Set up grpc client
metricsClient := otlpmetricgrpc.NewClient(
otlpmetricgrpc.WithInsecure(),
otlpmetricgrpc.WithEndpoint("localhost:4317"),
)
// Set up http client. This one works
// Uncomment to switch
// metricsClient := otlpmetrichttp.NewClient(
// otlpmetrichttp.WithInsecure(),
// otlpmetrichttp.WithEndpoint("localhost:4318"),
// )
metricsExporter := otlpmetric.NewUnstarted(metricsClient)
// Set up controller
ctlr := controller.New(
processor.New(
simple.NewWithExactDistribution(),
metricsExporter,
),
controller.WithExporter(metricsExporter),
controller.WithCollectPeriod(2*time.Second),
)
// Set meter provider
otelmetricglobal.SetMeterProvider(ctlr.MeterProvider())
// Start controller and exporter
handleErr(ctlr.Start(context.Background()), "failed to start controller", logger)
handleErr(metricsClient.Start(context.Background()), "failed to start exporter", logger)
// defer cleanup when done
defer func() {
// Push any last metric events to the exporter.
handleErr(ctlr.Stop(context.Background()), "failed to stop controller", logger)
handleErr(metricsExporter.Shutdown(context.Background()), "failed to stop metrics exporter", logger)
}()
// Start a ticker to send a heartbeat metric every 5 seconds or send some other metric of your choice
ticker := time.NewTicker(5 * time.Second)
heartBeatTickerDone := make(chan struct{}, 1)
defer ticker.Stop()
metricMeter := createMeter()
valueRecorder := metric.Must(metricMeter).NewInt64ValueRecorder("heartbeat")
for {
select {
case <-heartBeatTickerDone:
//wg.Done()
return
case <-ticker.C:
logger.Info("Creating metric")
valueRecorder.Record(context.Background(), 1)
}
}
}
func handleErr(err error, message string, logger *zap.Logger) {
if err != nil {
logger.Error(message, zap.Error(err))
}
}
func createMeter() metric.Meter {
return otelmetricglobal.Meter("test-grpc-metric-expoter")
}
- See error above after running for about 2.5 hours
Expected behavior
It should continue to happily report metrics to the otel collector for however long it runs.
By the way, this happens when I ran the container on GKE k8s but I cannot reproduce it on docker-desktop k8s which is quite strange.
This is also reproducible by using otlptracegrpc.New(ctx, otlptracegrpc.WithTimeout(some short timeout)) and not having the collector running. After a span ends it will try to connect and fail and the log line will be printed.
The connection handling to for the otlpmetricgrpc client was recently refactored (https://github.com/open-telemetry/opentelemetry-go/pull/2425) to use the grpc ClientConn to handle connections. This bug needs to be verified to still be relevant with these added changes.
This is still a bug for otlpmetricgrpc too. The reason is that the current implementation of exporter.Export / client.UploadMetrics by default sends all metrics in a collection / snapshot in one ExportMetricsServiceRequest. If the amount of metrics is too large to finish sending within the pushTimeout, or if the request size exceeds the server's grpc.MaxRecvMessage, the request will fail with a timeout error. The solution maybe batching large payloads into multiple requests or streaming gRPC.