opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

Transient error StatusCode.UNAVAILABLE

Open caydenwei opened this issue 8 months ago • 3 comments

Describe your environment

OS: (e.g, Ubuntu) Python version: 3.10.9 SDK version: 1.31.0 API version: 1.31.0 Opentelemetry collector: 0.115.1

Our application runs as a Kubernetes StatefulSet with 200 replicas using PeriodicExportingMetricReader for metrics export. During OpenTelemetry Collector redeployments, a subset of replicas persistently log:

Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s. These replicas fail to re-establish connection post-collector recovery, remaining in permanent retry state despite collector service restoration. But if I restart the application instance, it then recovered.

from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.tornado import TornadoInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.environment_variables import OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, \
    OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics._internal.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.semconv.resource import ResourceAttributes

otel_metrics_exporter = ConsoleMetricExporter(out=open(os.devnull, 'w'), formatter=lambda metrics_data: "")
if os.getenv(OTEL_EXPORTER_OTLP_METRICS_ENDPOINT, None):
    otel_metrics_exporter = OTLPMetricExporter(
        insecure=True,
        max_export_batch_size=512
    )

otel_metrics_reader = PeriodicExportingMetricReader(otel_metrics_exporter, export_interval_millis=15000)
metrics.set_meter_provider(
    MeterProvider(
        resource=Resource.create(attributes={
            ResourceAttributes.SERVICE_NAME: SERVICE_NAME,
            ResourceAttributes.SERVICE_INSTANCE_ID: EG_REPLICA_ID,
            ResourceAttributes.SERVICE_NAMESPACE: DEPLOYMENT_ENV
        }),
        metric_readers=[otel_metrics_reader]
    )
)

otel_meter = metrics.get_meter(__name__)


def _net_connections_established(options: CallbackOptions):
    connections = psutil.net_connections(kind='inet')
    established = sum(1 for conn in connections if conn.status == 'ESTABLISHED')
    yield Observation(int(established), {})


NET_CONNECTIONS_ESTABLISHED = otel_meter.create_observable_gauge(
    f'net_connections_established',
    unit='1',
    callbacks=[_net_connections_established],
    description='Current established connections count',
)

What happened?

Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s cannot be recovered, unless I restart the instance

Steps to Reproduce

Occasionally happen

Expected Result

Recover automatically

Actual Result

Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s cannot be recovered, unless I restart the instance. (Application instance, not opentelemetry instance)

Additional context

No response

Would you like to implement a fix?

None

caydenwei avatar Mar 31 '25 08:03 caydenwei

All 3 signals (metrics, logs, traces) never recovered with this scenario on my end too. I had to restart my app.

[2025-06-05 17:44:39,072] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154689799744]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:39,322] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154689799744]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:43,214] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231107 140154438149696]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 16s.
[2025-06-05 17:44:49,306] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231107 140154421364288]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:50,389] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154698192448]: Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:50,905] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140155283281472]: Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:50,925] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231107 140154429756992]: Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:59,244] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231107 140154438149696]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:45:05,977] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154698192448]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 1s.
[2025-06-05 17:45:06,486] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154706585152]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 1s.
[2025-06-05 17:45:06,984] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154698192448]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 2s.
[2025-06-05 17:45:07,490] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154706585152]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 2s.
[2025-06-05 17:45:08,993] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154698192448]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 4s.
[2025-06-05 17:45:11,107] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154689799744]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 1s.
[2025-06-05 17:45:11,324] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154689799744]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 1s.

(I included a larger block so it was more visible on the retry intervals and exponential backoff resetting, etc.)

Python 3.10.12

grpcio==1.71.0
opentelemetry-api==1.33.0
opentelemetry-distro==0.54b0
opentelemetry-exporter-otlp==1.33.0
opentelemetry-exporter-otlp-proto-common==1.33.0
opentelemetry-exporter-otlp-proto-grpc==1.33.0
opentelemetry-exporter-otlp-proto-http==1.33.0
opentelemetry-instrumentation==0.54b0
opentelemetry-instrumentation-aiohttp-server==0.54b0
opentelemetry-instrumentation-django==0.54b0
opentelemetry-instrumentation-wsgi==0.54b0
opentelemetry-proto==1.33.0
opentelemetry-sdk==1.33.0
opentelemetry-semantic-conventions==0.54b0
opentelemetry-util-http==0.54b0

devmonkey22 avatar Jun 05 '25 18:06 devmonkey22

As a workaround until the underlying https://github.com/grpc/grpc/issues/38290 issue is resolved , doing pip install 'grpcio<1.68' fixed my issue for now.

devmonkey22 avatar Jun 06 '25 15:06 devmonkey22

Looks like that underlying issue is now resolved FYI

chrisjbremner avatar Oct 21 '25 23:10 chrisjbremner