Fix: Reinitialize gRPC channel on UNAVAILABLE error (Fixes #4517) (Fixes #4529)

Open dheeraj-vanamala opened this issue 2 weeks ago • 2 comments

Description

This PR fixes issue #4517 where the OTLP gRPC exporter fails to reconnect to the collector after a restart (returning UNAVAILABLE).

Changes:

Detected StatusCode.UNAVAILABLE in the export loop.
Added logic to close the existing channel and re-initialize it before retrying.
Added a regression test test_unavailable_reconnects to verify the reconnection behavior.

Fixes #4517

Type of change

[x] Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I added a new regression test case test_unavailable_reconnects in exporter/opentelemetry-exporter-otlp-proto-grpc/tests/test_otlp_exporter_mixin.py.

[x] test_unavailable_reconnects: Verifies that the exporter closes and re-initializes the gRPC channel when the server returns StatusCode.UNAVAILABLE.

Does This PR Require a Contrib Repo Change?

[x] No.

Checklist:

[x] Followed the style guidelines of this project
[ ] Changelogs have been updated
[x] Unit tests have been added
[ ] Documentation has been updated

Nov 30 '25 15:11 dheeraj-vanamala

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: dheeraj-vanamala / name: Dheeraj Vanamala (2c848f44662785ce7c41216150c84f8df3433e63, 31594a3c25ebdfe73752f23303b234240dbf36c5, 436ecc98da12c702dc9579a3792b9e601b32f528, 8b397a775fb6dfced456d2ef6599eff79dc751c9)

Nov 30 '25 15:11 linux-foundation-easycla[bot]

I understand this issue is related to the upstream gRPC bug (grpc/grpc#38290).

I've analyzed that issue in depth, and the root cause appears to be a regression in the gRPC 'backup poller' (introduced in grpcio>=1.68.0) which fails to recover connections when the primary EventEngine is disabled (common in Python for fork safety).

While upstream fixes are being explored (e.g., grpc/grpc#38480), the issue has persisted for months, leaving exporters stuck in an UNAVAILABLE state indefinitely after collector restarts.

This PR implements a robust mitigation: detecting the persistent UNAVAILABLE state and forcing a channel re-initialization. This effectively resets the underlying poller state, allowing the exporter to recover immediately without requiring a full application restart. This approach provides stability for users while the complex upstream fix is finalized.

Nov 30 '25 16:11 dheeraj-vanamala