toolhive

toolhive copied to clipboard

Published 3 weeks ago •

Reame
Issues

Add health monitoring and circuit breaker to vMCP server

Open JAORMX opened this issue 3 weeks ago • 0 comments

Description

Implement health monitoring and circuit breaker patterns to detect and handle backend failures gracefully.

Scope:

Periodic health checks for backend MCP servers
Backend health status tracking and reporting
Circuit breaker implementation for failing backends
Automatic backend removal/restoration based on health
Health status reflected in vMCP status and capabilities

Key Components

1. Backend Health Checks

Periodic health check requests to each backend
Configurable check interval (default: 30s)
Track consecutive failures
Mark backend unhealthy after threshold (default: 3 failures)
Health check endpoint: MCP ping or tools/list

2. Health Status Tracking

Per-backend health state: healthy, unhealthy, unknown
Last successful health check timestamp
Failure count and error messages
Health status exposed in vMCP status/metrics

3. Circuit Breaker

Three states: closed (normal), open (failing), half-open (testing recovery)
Configurable failure threshold (default: 5 failures)
Configurable timeout for open state (default: 60s)
Automatic transition to half-open for recovery testing
Track circuit breaker state per backend

4. Backend Availability Management

Remove unhealthy backend tools from aggregated capabilities
Return error when routing to unavailable backend
Automatically restore backend when health recovers
Log backend state transitions (healthy ↔ unhealthy)
Emit metrics for monitoring systems

5. Failure Modes

fail mode (default): Fail entire request if backend unavailable
best_effort mode: Return partial results, include errors for failed backends
Configurable per vMCP instance

6. Integration Points

Update aggregated capabilities when backend health changes
Routing layer checks health before forwarding
Status reporting includes backend health summary
Metrics exported for Prometheus/observability

Implementation Notes

Health checks run in background goroutines
Use context for cancellation on shutdown
Circuit breaker prevents cascading failures
Health status should be cached to avoid repeated checks
Transition events should be logged and emitted as metrics
Follow existing ToolHive observability patterns in pkg/telemetry/
Consider using sony/gobreaker or similar library

Reference

vMCP Design Proposal - Backend Unavailability, Partial Failures, Circuit Breaker sections

Acceptance Criteria

[ ] Periodic health check implementation
[ ] Configurable health check interval
[ ] Backend health state tracking (healthy/unhealthy/unknown)
[ ] Consecutive failure counting
[ ] Unhealthy threshold configuration
[ ] Circuit breaker implementation (closed/open/half-open states)
[ ] Configurable failure threshold for circuit breaker
[ ] Configurable timeout for open state
[ ] Automatic transition to half-open for recovery testing
[ ] Remove unhealthy backend tools from capabilities
[ ] Error response when routing to unavailable backend
[ ] Automatic restoration when backend recovers
[ ] Health state transition logging
[ ] Partial failure mode support (fail vs best_effort)
[ ] Metrics emission for observability
[ ] Unit tests for health check logic
[ ] Unit tests for circuit breaker state machine
[ ] Integration tests with flaky mock backends
[ ] E2E tests with backend failure and recovery scenarios

Dec 02 '25 20:12 JAORMX

Labels

enhancement

go

api

telemetry

vmcp

Owner

Other Repo Issues