google-cloud-go feat(internal): implement Bigtable specific channel pool optimizations

The hot path (selecting a connection for an RPC) is highly optimized for performance. as The list of connections is stored in an atomic.Value, and load counters are managed with atomic operations
The pool automatically detects and evicts the single worst-performing unhealthy connection at a regular interval.
If the percentage of unhealthy connections exceeds a high-water mark (PoolwideBadThreshPercent), all evictions are suspended to avoid overwhelming the system during a wider service degradation.
Evictions are rate-limited by a minimum interval (MinEvictionInterval) to ensure stability.
A ChannelHealthMonitor runs in the background, periodically probing each connection in the pool.
Probes are performed by sending a PingAndWarm RPC to the Bigtable backend, verifying end-to-end connectivity.
Connection health is evaluated based on the percentage of failed probes over a sliding time window (WindowDuration). A connection is marked unhealthy if its failure rate exceeds a configurable threshold (FailurePercentThresh).

Oct 24 '25 19:10 sushanb

Summary of Changes

Hello @sushanb, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive health checking and self-healing mechanism for the Bigtable gRPC connection pool. The primary goal is to enhance the reliability and performance of the client by proactively identifying and replacing unhealthy connections. This ensures that requests are consistently routed to functional channels, improving overall service stability and responsiveness.

Highlights

Enhanced gRPC Channel Health Checking: Implemented a robust health monitoring system for Bigtable gRPC connections, utilizing periodic Prime() RPCs (which internally call PingAndWarm) to verify end-to-end connectivity.
Intelligent Unhealthy Connection Eviction: The connection pool now automatically detects and evicts the single worst-performing unhealthy connection at regular intervals, based on a configurable failure rate over a sliding time window.
Circuit Breaker for Pool Stability: A circuit breaker mechanism is introduced to suspend evictions if a high percentage of connections are unhealthy, preventing cascading failures during widespread service degradation.
Improved Connection Management: The internal BigtableChannelPool now uses atomic.Value to store connection entries (connEntry), which encapsulate the BigtableConn (a wrapper around grpc.ClientConn), its load, and its health state, allowing for efficient and thread-safe updates.
Configurable Logging: Added a Logger option to the Bigtable client and connection pool, along with debug logging utilities, to provide better visibility into connection health and pool operations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Oct 24 '25 19:10 gemini-code-assist[bot]

FYI. i have a TODO to make DynamicChannelScaleup, MetricsExporter, HealthChecker all part of BackgroundProcess interface.

Oct 27 '25 21:10 sushanb

Few high-level comments from the first pass:

Concurrent scaling down and picking a connection. There is a chance that while selectFunc picks a connection to send RPC to, the connection may be closed by removeConnections. The RPC will be refused immediately on a closed connection. We should have some protection or workaround here.
Concurrent scaling and eviction. While this doesn't crash anything it can still bring surprising edge cases. For example, there is a small chance that the pool can scale up/down while detectAndEvictUnhealthy is picking an index of a channel to evict. As the detectAndEvictUnhealthy works with a snapshot of channels prior to scaling event, the index it passes to replaceConnection may point to a "good" channel. Maybe we should pass a pointer to a channel instead, or restrict running scaling and eviction at the same time.
It feels like ChannelHealthMonitor and DynamicScaleMonitor should live in their own files (maybe same package though).

Oct 28 '25 23:10 nimf

Few high-level comments from the first pass:

Concurrent scaling down and picking a connection. There is a chance that while selectFunc picks a connection to send RPC to, the connection may be closed by removeConnections. The RPC will be refused immediately on a closed connection. We should have some protection or workaround here.

I added a drainingState atomic.Bool in connEntry. We avoid choosing a conn in drainState and actively wait for a period of time for the load on the chosen conn to evict to be zero and Close() it. Close() is thread safe so can be called multiple times on conn.

Concurrent scaling and eviction. While this doesn't crash anything it can still bring surprising edge cases. For example, there is a small chance that the pool can scale up/down while detectAndEvictUnhealthy is picking an index of a channel to evict. As the detectAndEvictUnhealthy works with a snapshot of channels prior to scaling event, the index it passes to replaceConnection may point to a "good" channel.

replaceConnection takes the actual conn pointer rather than index which avoids this bug. Thanks for finding the bug.

Maybe we should pass a pointer to a channel instead, or restrict running scaling and eviction at the same time. 3. It feels like ChannelHealthMonitor and DynamicScaleMonitor should live in their own files (maybe same package though).

Will refactor.

Oct 29 '25 21:10 sushanb