trino-gateway icon indicating copy to clipboard operation
trino-gateway copied to clipboard

Introduce a "Paused" State for Improved Backend Health Check Management

Open shohamyamin opened this issue 1 year ago • 2 comments

Current Behavior

Trino Gateway currently deactivates a backend if it fails a health check. Once deactivated, the backend remains inactive until manually reset to active, as automatic recovery is not supported.

Proposed Improvement

Introduce a new state, "Paused," to distinguish between:

  1. Intentionally paused backends: Marked as "Paused" by users to ensure no traffic is routed to them intentionally.
  2. Unhealthy backends: Automatically deactivated by the gateway but continuously monitored via health checks.

When a backend becomes healthy after previously failing, the gateway would automatically transition it back to an active state, resuming request handling without manual intervention.

Benefits

  • Enhanced automation: Reduces the need for manual intervention when backends recover from transient issues.
  • Operational clarity: Clearly separates user-initiated pauses from automatic deactivations caused by health check failures.
  • Improved availability: Ensures backends are reintroduced to the pool as soon as they recover, minimizing downtime.

Implementation Details

  1. State Management:

    • Add a "Paused" state to the gateway.
    • Backends marked as "Paused" by users would not be eligible for automatic reactivation.
    • Backends that fail health checks would transition to an "inactive" state and remain eligible for automatic reactivation.
  2. Health Check Monitoring:

    • Continue health checks for "inactive" backends.
    • Automatically transition them to "active" once they pass the health checks.
  3. User Interaction:

    • Users can manually set a backend to "Paused."
    • The gateway should provide clear feedback about the reason for a backend's current state (e.g., paused by the user or deactivated due to health check failure).

Example Workflow

  1. A backend fails a health check and is marked "inactive."
  2. The gateway continues monitoring the backend.
  3. Once the backend is healthy, it automatically transitions to "active."
  4. A user intentionally pauses a backend, marking it as "Paused." This backend will not be reactivated automatically, even if healthy.

shohamyamin avatar Nov 29 '24 12:11 shohamyamin

Is this a duplicate of https://github.com/trinodb/trino-gateway/issues/80?

rdsarvar avatar Dec 02 '24 04:12 rdsarvar

I wouldnt call it a duplicate .. but closely related and overlapping ideas.

mosabua avatar Dec 02 '24 04:12 mosabua