toolhive icon indicating copy to clipboard operation
toolhive copied to clipboard

Add health monitoring and circuit breaker to vMCP server

Open JAORMX opened this issue 3 weeks ago • 0 comments

Description

Implement health monitoring and circuit breaker patterns to detect and handle backend failures gracefully.

Scope:

  • Periodic health checks for backend MCP servers
  • Backend health status tracking and reporting
  • Circuit breaker implementation for failing backends
  • Automatic backend removal/restoration based on health
  • Health status reflected in vMCP status and capabilities

Key Components

1. Backend Health Checks

  • Periodic health check requests to each backend
  • Configurable check interval (default: 30s)
  • Track consecutive failures
  • Mark backend unhealthy after threshold (default: 3 failures)
  • Health check endpoint: MCP ping or tools/list

2. Health Status Tracking

  • Per-backend health state: healthy, unhealthy, unknown
  • Last successful health check timestamp
  • Failure count and error messages
  • Health status exposed in vMCP status/metrics

3. Circuit Breaker

  • Three states: closed (normal), open (failing), half-open (testing recovery)
  • Configurable failure threshold (default: 5 failures)
  • Configurable timeout for open state (default: 60s)
  • Automatic transition to half-open for recovery testing
  • Track circuit breaker state per backend

4. Backend Availability Management

  • Remove unhealthy backend tools from aggregated capabilities
  • Return error when routing to unavailable backend
  • Automatically restore backend when health recovers
  • Log backend state transitions (healthy ↔ unhealthy)
  • Emit metrics for monitoring systems

5. Failure Modes

  • fail mode (default): Fail entire request if backend unavailable
  • best_effort mode: Return partial results, include errors for failed backends
  • Configurable per vMCP instance

6. Integration Points

  • Update aggregated capabilities when backend health changes
  • Routing layer checks health before forwarding
  • Status reporting includes backend health summary
  • Metrics exported for Prometheus/observability

Implementation Notes

  • Health checks run in background goroutines
  • Use context for cancellation on shutdown
  • Circuit breaker prevents cascading failures
  • Health status should be cached to avoid repeated checks
  • Transition events should be logged and emitted as metrics
  • Follow existing ToolHive observability patterns in pkg/telemetry/
  • Consider using sony/gobreaker or similar library

Reference

Acceptance Criteria

  • [ ] Periodic health check implementation
  • [ ] Configurable health check interval
  • [ ] Backend health state tracking (healthy/unhealthy/unknown)
  • [ ] Consecutive failure counting
  • [ ] Unhealthy threshold configuration
  • [ ] Circuit breaker implementation (closed/open/half-open states)
  • [ ] Configurable failure threshold for circuit breaker
  • [ ] Configurable timeout for open state
  • [ ] Automatic transition to half-open for recovery testing
  • [ ] Remove unhealthy backend tools from capabilities
  • [ ] Error response when routing to unavailable backend
  • [ ] Automatic restoration when backend recovers
  • [ ] Health state transition logging
  • [ ] Partial failure mode support (fail vs best_effort)
  • [ ] Metrics emission for observability
  • [ ] Unit tests for health check logic
  • [ ] Unit tests for circuit breaker state machine
  • [ ] Integration tests with flaky mock backends
  • [ ] E2E tests with backend failure and recovery scenarios

JAORMX avatar Dec 02 '25 20:12 JAORMX