trino-gateway
trino-gateway copied to clipboard
Introduce a "Paused" State for Improved Backend Health Check Management
Current Behavior
Trino Gateway currently deactivates a backend if it fails a health check. Once deactivated, the backend remains inactive until manually reset to active, as automatic recovery is not supported.
Proposed Improvement
Introduce a new state, "Paused," to distinguish between:
- Intentionally paused backends: Marked as "Paused" by users to ensure no traffic is routed to them intentionally.
- Unhealthy backends: Automatically deactivated by the gateway but continuously monitored via health checks.
When a backend becomes healthy after previously failing, the gateway would automatically transition it back to an active state, resuming request handling without manual intervention.
Benefits
- Enhanced automation: Reduces the need for manual intervention when backends recover from transient issues.
- Operational clarity: Clearly separates user-initiated pauses from automatic deactivations caused by health check failures.
- Improved availability: Ensures backends are reintroduced to the pool as soon as they recover, minimizing downtime.
Implementation Details
-
State Management:
- Add a "Paused" state to the gateway.
- Backends marked as "Paused" by users would not be eligible for automatic reactivation.
- Backends that fail health checks would transition to an "inactive" state and remain eligible for automatic reactivation.
-
Health Check Monitoring:
- Continue health checks for "inactive" backends.
- Automatically transition them to "active" once they pass the health checks.
-
User Interaction:
- Users can manually set a backend to "Paused."
- The gateway should provide clear feedback about the reason for a backend's current state (e.g., paused by the user or deactivated due to health check failure).
Example Workflow
- A backend fails a health check and is marked "inactive."
- The gateway continues monitoring the backend.
- Once the backend is healthy, it automatically transitions to "active."
- A user intentionally pauses a backend, marking it as "Paused." This backend will not be reactivated automatically, even if healthy.
Is this a duplicate of https://github.com/trinodb/trino-gateway/issues/80?
I wouldnt call it a duplicate .. but closely related and overlapping ideas.