opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

GRPC metric exporter doesn't reconnect

Open BalazsBago opened this issue 9 months ago • 9 comments

Describe your environment

OS: Ubuntu Python version: Python 3.8 SDK version: 1.27.0 API version: 1.27.0 Exporter: 1.27.0

Endpoint: Telegraf, docker, 1.28, OpenTelemetry input(https://github.com/influxdata/telegraf/tree/master/plugins/inputs/opentelemetry)

What happened?

If the metric endpoint does not exist at the start of a "PeriodicExportingMetricReader" with an "OTLPMetricExporter" then it can't connect to it, even after the endpoint gets alive. It tries to resend the metric, but without any success:

Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to localhost:4317, retrying in 1s. ...

Steps to Reproduce

  1. Start a "PeriodicExportingMetricReader" with an "OTLPMetricExporter"
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader


reader = PeriodicExportingMetricReader(
        exporter=OTLPMetricExporter(
            endpoint="http://localhost:4317/v1/metrics",
            timeout=60,
        ),
        export_interval_millis=5000,
    )

metrics.set_meter_provider(
    MeterProvider(
        metric_readers=[reader],
    )
)

meter = metrics.get_meter(name='test')

inst = meter.create_counter('counter')
inst.add(1)
inst.add(1)

time.sleep(120)
  1. Start Telegraf with OpenTelemetry endpoint Config file (/tmp/tele.conf)
[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false
[[inputs.opentelemetry]]          
[[outputs.file]]
  files = ["stdout"]
  data_format = "influx"

Start telegraf:

docker run --rm -it -p 4317:4317 -v /tmp/tele.conf:/etc/telegraf/telegraf.conf telegraf:1.28

Expected Result

The exporter should try to rebuild the connection to the endpoint in case of "StatusCode.UNAVAILABLE".

Actual Result

The exporter gets stuck in "StatusCode.UNAVAILABLE" status.

Additional context

No response

Would you like to implement a fix?

None

BalazsBago avatar Feb 17 '25 10:02 BalazsBago

I have same problem with GRPC and reconnect, HTTP exporter works fine and survive Otel-collector UP/DOWN.

OndraVoves avatar Feb 17 '25 12:02 OndraVoves

There is repro repo for similar case.

OndraVoves avatar Feb 17 '25 13:02 OndraVoves

I think this might be related to #4429.

rjduffner avatar Feb 18 '25 23:02 rjduffner

I am facing the same issue. I do not think that fixing #4429 solves it, because I am able to reproduce my issues even when running with opentelemetry-exporter-otlp==1.31.1, which contains the fixes for #4429.

I am also able to reproduce the same systems if the metrics endpoint (in this case the telegraf agent OpenTelemetry input plugin) is available at the beginning of the test, but then shut down for a few seconds during the test. After the metric endpoint is restarted, the open telemetry exporter continues to fail with Transient error StatusCode.UNAVAILABLE encountered while exporting metrics messages.

I suspect that something is not right with the retry logic or usage of gRPC channels in opentelemetry.exporter.otlp.proto.grpc.exporter.OTLPExporterMixin._export (link). I monitored the network traffic between my python code and the metrics endpoint with Wireshark while the exporter was failing, and the only network requests I saw were TCP keep alives. I did not see any actual metrics attempting to be sent.

rmelick-muon avatar Mar 28 '25 13:03 rmelick-muon

I'm also seeing this when using auto instrumentation with FastAPI. My application seems unable to reconnect to the local gRPC sink as described by @rmelick-muon.

nat45928 avatar Mar 28 '25 15:03 nat45928

I see, same issue. with opentelemetry-exporter-otlp==1.31.0

https://github.com/open-telemetry/opentelemetry-python/issues/4517

caydenwei avatar Mar 31 '25 09:03 caydenwei

I believe the issue may be due to: https://github.com/grpc/grpc/issues/38290

lambdal-dean avatar Apr 02 '25 15:04 lambdal-dean

this is a big issue. I cannot afford loosing all metrics of running pods just because of a grafana alloy upgrade.

david-gang avatar Apr 06 '25 10:04 david-gang

I believe the issue may be due to: grpc/grpc#38290

I agree with @lambdal-dean, I've been pinning grpcio==1.67.1 for months and following that issue. I tried the latest 1.71.0 and it's still not reconnecting.

gg-kialo avatar Apr 08 '25 13:04 gg-kialo

I believe this issue is now fixed in v1.35.0, which contains this fix https://github.com/open-telemetry/opentelemetry-python/pull/4564

I gave it a try with v1.35.0 and it's now working for me as well.

emdneto avatar Jul 17 '25 01:07 emdneto

Can someone please confirm if this is fixed following #4564 and https://github.com/grpc/grpc/issues/38290?

aabmass avatar Jul 17 '25 19:07 aabmass

Works for me. Thanks!

osminogin avatar Jul 18 '25 07:07 osminogin

Seems like this was a gRPC issue so I'm going to close this out. Please re-open or create a new issue if needed

aabmass avatar Aug 13 '25 20:08 aabmass