toolhive
toolhive copied to clipboard
Add health monitoring and circuit breaker to vMCP server
Description
Implement health monitoring and circuit breaker patterns to detect and handle backend failures gracefully.
Scope:
- Periodic health checks for backend MCP servers
- Backend health status tracking and reporting
- Circuit breaker implementation for failing backends
- Automatic backend removal/restoration based on health
- Health status reflected in vMCP status and capabilities
Key Components
1. Backend Health Checks
- Periodic health check requests to each backend
- Configurable check interval (default: 30s)
- Track consecutive failures
- Mark backend unhealthy after threshold (default: 3 failures)
- Health check endpoint: MCP
pingortools/list
2. Health Status Tracking
- Per-backend health state:
healthy,unhealthy,unknown - Last successful health check timestamp
- Failure count and error messages
- Health status exposed in vMCP status/metrics
3. Circuit Breaker
- Three states:
closed(normal),open(failing),half-open(testing recovery) - Configurable failure threshold (default: 5 failures)
- Configurable timeout for open state (default: 60s)
- Automatic transition to half-open for recovery testing
- Track circuit breaker state per backend
4. Backend Availability Management
- Remove unhealthy backend tools from aggregated capabilities
- Return error when routing to unavailable backend
- Automatically restore backend when health recovers
- Log backend state transitions (healthy ↔ unhealthy)
- Emit metrics for monitoring systems
5. Failure Modes
- fail mode (default): Fail entire request if backend unavailable
- best_effort mode: Return partial results, include errors for failed backends
- Configurable per vMCP instance
6. Integration Points
- Update aggregated capabilities when backend health changes
- Routing layer checks health before forwarding
- Status reporting includes backend health summary
- Metrics exported for Prometheus/observability
Implementation Notes
- Health checks run in background goroutines
- Use context for cancellation on shutdown
- Circuit breaker prevents cascading failures
- Health status should be cached to avoid repeated checks
- Transition events should be logged and emitted as metrics
- Follow existing ToolHive observability patterns in
pkg/telemetry/ - Consider using
sony/gobreakeror similar library
Reference
- vMCP Design Proposal - Backend Unavailability, Partial Failures, Circuit Breaker sections
Acceptance Criteria
- [ ] Periodic health check implementation
- [ ] Configurable health check interval
- [ ] Backend health state tracking (healthy/unhealthy/unknown)
- [ ] Consecutive failure counting
- [ ] Unhealthy threshold configuration
- [ ] Circuit breaker implementation (closed/open/half-open states)
- [ ] Configurable failure threshold for circuit breaker
- [ ] Configurable timeout for open state
- [ ] Automatic transition to half-open for recovery testing
- [ ] Remove unhealthy backend tools from capabilities
- [ ] Error response when routing to unavailable backend
- [ ] Automatic restoration when backend recovers
- [ ] Health state transition logging
- [ ] Partial failure mode support (fail vs best_effort)
- [ ] Metrics emission for observability
- [ ] Unit tests for health check logic
- [ ] Unit tests for circuit breaker state machine
- [ ] Integration tests with flaky mock backends
- [ ] E2E tests with backend failure and recovery scenarios