feat(internal): implement Bigtable specific channel pool optimizations
- The hot path (selecting a connection for an RPC) is highly optimized for performance. as The list of connections is stored in an atomic.Value, and load counters are managed with atomic operations
- The pool automatically detects and evicts the single worst-performing unhealthy connection at a regular interval.
- If the percentage of unhealthy connections exceeds a high-water mark (PoolwideBadThreshPercent), all evictions are suspended to avoid overwhelming the system during a wider service degradation.
- Evictions are rate-limited by a minimum interval (MinEvictionInterval) to ensure stability.
- A ChannelHealthMonitor runs in the background, periodically probing each connection in the pool.
- Probes are performed by sending a PingAndWarm RPC to the Bigtable backend, verifying end-to-end connectivity.
- Connection health is evaluated based on the percentage of failed probes over a sliding time window (WindowDuration). A connection is marked unhealthy if its failure rate exceeds a configurable threshold (FailurePercentThresh).
Summary of Changes
Hello @sushanb, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a comprehensive health checking and self-healing mechanism for the Bigtable gRPC connection pool. The primary goal is to enhance the reliability and performance of the client by proactively identifying and replacing unhealthy connections. This ensures that requests are consistently routed to functional channels, improving overall service stability and responsiveness.
Highlights
- Enhanced gRPC Channel Health Checking: Implemented a robust health monitoring system for Bigtable gRPC connections, utilizing periodic Prime() RPCs (which internally call PingAndWarm) to verify end-to-end connectivity.
- Intelligent Unhealthy Connection Eviction: The connection pool now automatically detects and evicts the single worst-performing unhealthy connection at regular intervals, based on a configurable failure rate over a sliding time window.
- Circuit Breaker for Pool Stability: A circuit breaker mechanism is introduced to suspend evictions if a high percentage of connections are unhealthy, preventing cascading failures during widespread service degradation.
- Improved Connection Management: The internal BigtableChannelPool now uses atomic.Value to store connection entries (connEntry), which encapsulate the BigtableConn (a wrapper around grpc.ClientConn), its load, and its health state, allowing for efficient and thread-safe updates.
- Configurable Logging: Added a Logger option to the Bigtable client and connection pool, along with debug logging utilities, to provide better visibility into connection health and pool operations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
FYI. i have a TODO to make DynamicChannelScaleup, MetricsExporter, HealthChecker all part of BackgroundProcess interface.
Few high-level comments from the first pass:
- Concurrent scaling down and picking a connection.
There is a chance that while
selectFuncpicks a connection to send RPC to, the connection may be closed byremoveConnections. The RPC will be refused immediately on a closed connection. We should have some protection or workaround here. - Concurrent scaling and eviction.
While this doesn't crash anything it can still bring surprising edge cases. For example, there is a small chance that the pool can scale up/down while
detectAndEvictUnhealthyis picking an index of a channel to evict. As thedetectAndEvictUnhealthyworks with a snapshot of channels prior to scaling event, the index it passes toreplaceConnectionmay point to a "good" channel. Maybe we should pass a pointer to a channel instead, or restrict running scaling and eviction at the same time. - It feels like
ChannelHealthMonitorandDynamicScaleMonitorshould live in their own files (maybe same package though).
Few high-level comments from the first pass:
- Concurrent scaling down and picking a connection. There is a chance that while
selectFuncpicks a connection to send RPC to, the connection may be closed byremoveConnections. The RPC will be refused immediately on a closed connection. We should have some protection or workaround here.
I added a drainingState atomic.Bool in connEntry. We avoid choosing a conn in drainState and actively wait for a period of time for the load on the chosen conn to evict to be zero and Close() it. Close() is thread safe so can be called multiple times on conn.
- Concurrent scaling and eviction. While this doesn't crash anything it can still bring surprising edge cases. For example, there is a small chance that the pool can scale up/down while
detectAndEvictUnhealthyis picking an index of a channel to evict. As thedetectAndEvictUnhealthyworks with a snapshot of channels prior to scaling event, the index it passes toreplaceConnectionmay point to a "good" channel.
replaceConnection takes the actual conn pointer rather than index which avoids this bug. Thanks for finding the bug.
Maybe we should pass a pointer to a channel instead, or restrict running scaling and eviction at the same time. 3. It feels like
ChannelHealthMonitorandDynamicScaleMonitorshould live in their own files (maybe same package though).
Will refactor.