[SaaS] Increase Health Check Timeout to Prevent Cascade Failures

Open gagantrivedi opened this issue 4 months ago • 0 comments

Problem

During the November 5th, 2025 outage (12:01PM - 12:06PM IST), all tasks were marked unhealthy because /health endpoint started timing out. The current 5-second timeout is too aggressive and caused a cascade failure that made a 5-minute spike into a complete outage.

Current Configuration

Timeout: 5 seconds
Interval: 15 seconds
Unhealthy threshold: 2 consecutive failures
Healthy threshold: 2 consecutive successes

Issue

When the API experienced a request spike, /health responses exceeded 5 seconds. Health checks failed → tasks marked unhealthy → load balancer removed tasks → remaining tasks overloaded → more failures. Cascade effect.

Proposed Change

Option 1: Increase timeout to [10? 15?] seconds to tolerate brief slowdowns without marking tasks unhealthy.

Option 2: Rethink health checks entirely:

Use passive health checks based on actual request success rates
Implement graceful degradation instead of binary healthy/unhealthy
Add circuit breaker logic to prevent cascade failures

Rationale:

Better to serve slow requests than mark everything dead
5s is too tight for real-world load spikes
Current approach creates cascading failures instead of preventing them

Questions

[ ] What's acceptable /health response time under load?
[ ] Should we separate health check endpoint from main API?
[ ] Should we adjust failure/success thresholds too?
[ ] Can we use passive health monitoring instead?

Nov 06 '25 03:11 gagantrivedi