Fix: Reinitialize gRPC channel on UNAVAILABLE error (Fixes #4517) (Fixes #4529)
Description
This PR fixes issue #4517 where the OTLP gRPC exporter fails to reconnect to the collector after a restart (returning UNAVAILABLE).
Changes:
- Detected
StatusCode.UNAVAILABLEin the export loop. - Added logic to close the existing channel and re-initialize it before retrying.
- Added a regression test test_unavailable_reconnects to verify the reconnection behavior.
Fixes #4517
Type of change
- [x] Bug fix (non-breaking change which fixes an issue)
How Has This Been Tested?
I added a new regression test case test_unavailable_reconnects in exporter/opentelemetry-exporter-otlp-proto-grpc/tests/test_otlp_exporter_mixin.py.
- [x] test_unavailable_reconnects: Verifies that the exporter closes and re-initializes the gRPC channel when the server returns
StatusCode.UNAVAILABLE.
Does This PR Require a Contrib Repo Change?
- [x] No.
Checklist:
- [x] Followed the style guidelines of this project
- [ ] Changelogs have been updated
- [x] Unit tests have been added
- [ ] Documentation has been updated
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: dheeraj-vanamala / name: Dheeraj Vanamala (2c848f44662785ce7c41216150c84f8df3433e63, 31594a3c25ebdfe73752f23303b234240dbf36c5, 436ecc98da12c702dc9579a3792b9e601b32f528, 8b397a775fb6dfced456d2ef6599eff79dc751c9)
I understand this issue is related to the upstream gRPC bug (grpc/grpc#38290).
I've analyzed that issue in depth, and the root cause appears to be a regression in the gRPC 'backup poller' (introduced in grpcio>=1.68.0) which fails to recover connections when the primary EventEngine is disabled (common in Python for fork safety).
While upstream fixes are being explored (e.g., grpc/grpc#38480), the issue has persisted for months, leaving exporters stuck in an UNAVAILABLE state indefinitely after collector restarts.
This PR implements a robust mitigation: detecting the persistent UNAVAILABLE state and forcing a channel re-initialization. This effectively resets the underlying poller state, allowing the exporter to recover immediately without requiring a full application restart. This approach provides stability for users while the complex upstream fix is finalized.