couchdb
couchdb copied to clipboard
health check error when present internal errors
Summary
It will be fine to define settings like
- internal_error_max_rate - example 0.001, which means if 0.1% of the requests failed with 5xx error code, then trigger healthcheck_fail.
- healthcheck_retry_timeout - example 600 seconds, how long to wait after the last internal error before returning "200 OK" for healthcheck.
This will allow to remove the failed CouchDB node for the request distribution on load balancers like AWS ALB.
Example when CouchDB health check returns "200 OK" when the node cannot pull a shard on the local filesystem https://github.com/apache/couchdb/issues/4790
We don't currently have an internal health check or alarm system. It's not a bad idea to have it, in principle, but since there is no such application or API currently we'd have design it. In the meantime it might be practical to drive it from metrics (_stats or _prometheus) or maybe a log parser.