Transient error StatusCode.UNAVAILABLE
Describe your environment
OS: (e.g, Ubuntu) Python version: 3.10.9 SDK version: 1.31.0 API version: 1.31.0 Opentelemetry collector: 0.115.1
Our application runs as a Kubernetes StatefulSet with 200 replicas using PeriodicExportingMetricReader for metrics export. During OpenTelemetry Collector redeployments, a subset of replicas persistently log:
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s. These replicas fail to re-establish connection post-collector recovery, remaining in permanent retry state despite collector service restoration. But if I restart the application instance, it then recovered.
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.tornado import TornadoInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.environment_variables import OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, \
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics._internal.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.semconv.resource import ResourceAttributes
otel_metrics_exporter = ConsoleMetricExporter(out=open(os.devnull, 'w'), formatter=lambda metrics_data: "")
if os.getenv(OTEL_EXPORTER_OTLP_METRICS_ENDPOINT, None):
otel_metrics_exporter = OTLPMetricExporter(
insecure=True,
max_export_batch_size=512
)
otel_metrics_reader = PeriodicExportingMetricReader(otel_metrics_exporter, export_interval_millis=15000)
metrics.set_meter_provider(
MeterProvider(
resource=Resource.create(attributes={
ResourceAttributes.SERVICE_NAME: SERVICE_NAME,
ResourceAttributes.SERVICE_INSTANCE_ID: EG_REPLICA_ID,
ResourceAttributes.SERVICE_NAMESPACE: DEPLOYMENT_ENV
}),
metric_readers=[otel_metrics_reader]
)
)
otel_meter = metrics.get_meter(__name__)
def _net_connections_established(options: CallbackOptions):
connections = psutil.net_connections(kind='inet')
established = sum(1 for conn in connections if conn.status == 'ESTABLISHED')
yield Observation(int(established), {})
NET_CONNECTIONS_ESTABLISHED = otel_meter.create_observable_gauge(
f'net_connections_established',
unit='1',
callbacks=[_net_connections_established],
description='Current established connections count',
)
What happened?
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s cannot be recovered, unless I restart the instance
Steps to Reproduce
Occasionally happen
Expected Result
Recover automatically
Actual Result
Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to opentelemetry-collector.monitor.svc.cluster.local:4317, retrying in 8s cannot be recovered, unless I restart the instance. (Application instance, not opentelemetry instance)
Additional context
No response
Would you like to implement a fix?
None
All 3 signals (metrics, logs, traces) never recovered with this scenario on my end too. I had to restart my app.
[2025-06-05 17:44:39,072] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154689799744]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:39,322] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154689799744]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:43,214] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231107 140154438149696]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 16s.
[2025-06-05 17:44:49,306] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231107 140154421364288]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:50,389] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154698192448]: Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:50,905] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140155283281472]: Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:50,925] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231107 140154429756992]: Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:44:59,244] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231107 140154438149696]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 32s.
[2025-06-05 17:45:05,977] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154698192448]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 1s.
[2025-06-05 17:45:06,486] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154706585152]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 1s.
[2025-06-05 17:45:06,984] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154698192448]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 2s.
[2025-06-05 17:45:07,490] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154706585152]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 2s.
[2025-06-05 17:45:08,993] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154698192448]: Transient error StatusCode.UNAVAILABLE encountered while exporting traces to 192.168.66.178:4317, retrying in 4s.
[2025-06-05 17:45:11,107] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231039 140154689799744]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 1s.
[2025-06-05 17:45:11,324] [WARNING] in [opentelemetry.exporter.otlp.proto.grpc.exporter 231058 140154689799744]: Transient error StatusCode.UNAVAILABLE encountered while exporting logs to 192.168.66.178:4317, retrying in 1s.
(I included a larger block so it was more visible on the retry intervals and exponential backoff resetting, etc.)
Python 3.10.12
grpcio==1.71.0
opentelemetry-api==1.33.0
opentelemetry-distro==0.54b0
opentelemetry-exporter-otlp==1.33.0
opentelemetry-exporter-otlp-proto-common==1.33.0
opentelemetry-exporter-otlp-proto-grpc==1.33.0
opentelemetry-exporter-otlp-proto-http==1.33.0
opentelemetry-instrumentation==0.54b0
opentelemetry-instrumentation-aiohttp-server==0.54b0
opentelemetry-instrumentation-django==0.54b0
opentelemetry-instrumentation-wsgi==0.54b0
opentelemetry-proto==1.33.0
opentelemetry-sdk==1.33.0
opentelemetry-semantic-conventions==0.54b0
opentelemetry-util-http==0.54b0
As a workaround until the underlying https://github.com/grpc/grpc/issues/38290 issue is resolved , doing pip install 'grpcio<1.68' fixed my issue for now.
Looks like that underlying issue is now resolved FYI