opentelemetry-python-contrib icon indicating copy to clipboard operation
opentelemetry-python-contrib copied to clipboard

Random connection reset errors affecting Celery

Open danw-mpl opened this issue 1 year ago • 1 comments
trafficstars

Describe your environment

  • Centralised AWS OTEL Collector (latest, although same issue with older versions)
  • AWS Application Load Balancer fronting the collector
  • The collector debug logs have lines such as loopyWriter exiting with error: transport closed by client
  • Only affects Celery workers in my stack, Gunicorn and others are unaffected
  • Python 3.9
  • Current versions of OpenTelemetry
opentelemetry-api==1.24.0
opentelemetry-distro==0.45b0
opentelemetry-exporter-otlp==1.24.0
opentelemetry-exporter-otlp-proto-common==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0
opentelemetry-instrumentation==0.45b0
opentelemetry-instrumentation-botocore==0.45b0
opentelemetry-instrumentation-celery==0.45b0
opentelemetry-instrumentation-dbapi==0.45b0
opentelemetry-instrumentation-django==0.45b0
opentelemetry-instrumentation-logging==0.45b0
opentelemetry-instrumentation-psycopg2==0.45b0
opentelemetry-instrumentation-redis==0.45b0
opentelemetry-instrumentation-requests==0.45b0
opentelemetry-instrumentation-wsgi==0.45b0
opentelemetry-propagator-aws-xray==1.0.1
opentelemetry-proto==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-sdk-extension-aws==2.0.1
opentelemetry-semantic-conventions==0.45b0
opentelemetry-util-http==0.45b0

Steps to reproduce Run a task on a Celery worker with opentelemetry-instrument.

What is the expected behavior? No errors reported.

What is the actual behavior? Any task a Celery worker executes results in an HTTP connection reset error or gRPC equivalent, but the traces are still sent successfully.

Additional context I'm not getting these errors on non-Celery processes such as Gunicorn, etc.

It's incredibly challenging to diagnose this issue, so I'm not certain whether it's an issue with my stack or how Celery is handling auto instrumentation.

Anyone else seen this issue?

danw-mpl avatar Apr 03 '24 15:04 danw-mpl

This is still ongoing sadly. The client Python logs look like Transient error StatusCode.UNAVAILABLE encountered while exporting traces to ....

Any ideas would be greatly appreciated!

danw-mpl avatar Jul 26 '24 10:07 danw-mpl