health check error when present internal errors

Open sergey-safarov opened this issue 4 months ago • 1 comments

Summary

It will be fine to define settings like

internal_error_max_rate - example 0.001, which means if 0.1% of the requests failed with 5xx error code, then trigger healthcheck_fail.

healthcheck_retry_timeout - example 600 seconds, how long to wait after the last internal error before returning "200 OK" for healthcheck.

This will allow to remove the failed CouchDB node for the request distribution on load balancers like AWS ALB.

Example when CouchDB health check returns "200 OK" when the node cannot pull a shard on the local filesystem https://github.com/apache/couchdb/issues/4790

Aug 11 '25 20:08 sergey-safarov

We don't currently have an internal health check or alarm system. It's not a bad idea to have it, in principle, but since there is no such application or API currently we'd have design it. In the meantime it might be practical to drive it from metrics (_stats or _prometheus) or maybe a log parser.

Aug 13 '25 05:08 nickva