GRPC metric exporter doesn't reconnect
Describe your environment
OS: Ubuntu Python version: Python 3.8 SDK version: 1.27.0 API version: 1.27.0 Exporter: 1.27.0
Endpoint: Telegraf, docker, 1.28, OpenTelemetry input(https://github.com/influxdata/telegraf/tree/master/plugins/inputs/opentelemetry)
What happened?
If the metric endpoint does not exist at the start of a "PeriodicExportingMetricReader" with an "OTLPMetricExporter" then it can't connect to it, even after the endpoint gets alive. It tries to resend the metric, but without any success:
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to localhost:4317, retrying in 1s. ...
Steps to Reproduce
- Start a "PeriodicExportingMetricReader" with an "OTLPMetricExporter"
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
reader = PeriodicExportingMetricReader(
exporter=OTLPMetricExporter(
endpoint="http://localhost:4317/v1/metrics",
timeout=60,
),
export_interval_millis=5000,
)
metrics.set_meter_provider(
MeterProvider(
metric_readers=[reader],
)
)
meter = metrics.get_meter(name='test')
inst = meter.create_counter('counter')
inst.add(1)
inst.add(1)
time.sleep(120)
- Start Telegraf with OpenTelemetry endpoint Config file (/tmp/tele.conf)
[global_tags]
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = "0s"
hostname = ""
omit_hostname = false
[[inputs.opentelemetry]]
[[outputs.file]]
files = ["stdout"]
data_format = "influx"
Start telegraf:
docker run --rm -it -p 4317:4317 -v /tmp/tele.conf:/etc/telegraf/telegraf.conf telegraf:1.28
Expected Result
The exporter should try to rebuild the connection to the endpoint in case of "StatusCode.UNAVAILABLE".
Actual Result
The exporter gets stuck in "StatusCode.UNAVAILABLE" status.
Additional context
No response
Would you like to implement a fix?
None
I have same problem with GRPC and reconnect, HTTP exporter works fine and survive Otel-collector UP/DOWN.
There is repro repo for similar case.
I think this might be related to #4429.
I am facing the same issue. I do not think that fixing #4429 solves it, because I am able to reproduce my issues even when running with opentelemetry-exporter-otlp==1.31.1, which contains the fixes for #4429.
I am also able to reproduce the same systems if the metrics endpoint (in this case the telegraf agent OpenTelemetry input plugin) is available at the beginning of the test, but then shut down for a few seconds during the test. After the metric endpoint is restarted, the open telemetry exporter continues to fail with Transient error StatusCode.UNAVAILABLE encountered while exporting metrics messages.
I suspect that something is not right with the retry logic or usage of gRPC channels in opentelemetry.exporter.otlp.proto.grpc.exporter.OTLPExporterMixin._export (link). I monitored the network traffic between my python code and the metrics endpoint with Wireshark while the exporter was failing, and the only network requests I saw were TCP keep alives. I did not see any actual metrics attempting to be sent.
I'm also seeing this when using auto instrumentation with FastAPI. My application seems unable to reconnect to the local gRPC sink as described by @rmelick-muon.
I see, same issue. with opentelemetry-exporter-otlp==1.31.0
https://github.com/open-telemetry/opentelemetry-python/issues/4517
I believe the issue may be due to: https://github.com/grpc/grpc/issues/38290
this is a big issue. I cannot afford loosing all metrics of running pods just because of a grafana alloy upgrade.
I believe the issue may be due to: grpc/grpc#38290
I agree with @lambdal-dean, I've been pinning grpcio==1.67.1 for months and following that issue. I tried the latest 1.71.0 and it's still not reconnecting.
I believe this issue is now fixed in v1.35.0, which contains this fix https://github.com/open-telemetry/opentelemetry-python/pull/4564
I gave it a try with v1.35.0 and it's now working for me as well.
Can someone please confirm if this is fixed following #4564 and https://github.com/grpc/grpc/issues/38290?
Works for me. Thanks!
Seems like this was a gRPC issue so I'm going to close this out. Please re-open or create a new issue if needed