flagsmith icon indicating copy to clipboard operation
flagsmith copied to clipboard

[SaaS] Increase Health Check Timeout to Prevent Cascade Failures

Open gagantrivedi opened this issue 4 months ago • 0 comments

Problem

During the November 5th, 2025 outage (12:01PM - 12:06PM IST), all tasks were marked unhealthy because /health endpoint started timing out. The current 5-second timeout is too aggressive and caused a cascade failure that made a 5-minute spike into a complete outage.

Current Configuration

Timeout: 5 seconds
Interval: 15 seconds
Unhealthy threshold: 2 consecutive failures
Healthy threshold: 2 consecutive successes

Issue

When the API experienced a request spike, /health responses exceeded 5 seconds. Health checks failed → tasks marked unhealthy → load balancer removed tasks → remaining tasks overloaded → more failures. Cascade effect.

Proposed Change

Option 1: Increase timeout to [10? 15?] seconds to tolerate brief slowdowns without marking tasks unhealthy.

Option 2: Rethink health checks entirely:

  • Use passive health checks based on actual request success rates
  • Implement graceful degradation instead of binary healthy/unhealthy
  • Add circuit breaker logic to prevent cascade failures

Rationale:

  • Better to serve slow requests than mark everything dead
  • 5s is too tight for real-world load spikes
  • Current approach creates cascading failures instead of preventing them

Questions

  • [ ] What's acceptable /health response time under load?
  • [ ] Should we separate health check endpoint from main API?
  • [ ] Should we adjust failure/success thresholds too?
  • [ ] Can we use passive health monitoring instead?

gagantrivedi avatar Nov 06 '25 03:11 gagantrivedi