caddy reverseproxy: Health as a ratio of successful requests

trafficstars

Closes #4949

This makes it possible to keep track of successful requests over time for each upstream. Same mechanism as failure duration, i.e. a the count is increased right after the response is written from the proxy, and a goroutine is spawned to decrement the counter after a certain delay.

This also adds a minimum successful requests ratio, so if there's too many failures compared to successes, the upstream is marked unhealthy. This is paired with a minimum success count (defaulting to 5 as a reasonable lower bound) so that the ratio only applies after there's at least N successes in memory, otherwise a single failure at the start might take out the server.

Example config:

reverse_proxy :8002 {
	success_duration 10s
	fail_duration 10s
	max_fails 10
	unhealthy_status 500
	min_success_ratio 9/10
	min_successes 5
}

Essentially, the above means that within the past 10s, there must have been less than 10 fails or more than 90% success (unless there's less than 5 successes), otherwise the upstream is marked unhealthy. (For testing, I used unhealthy_status 500 as a reliable way to increment fails intentionally).

This is obviously more useful with multiple upstreams, but the above example uses just one to more quickly and reliably show the effect of the config.

Feb 26 '23 07:02 francislavoie

Thanks for working on this. My main concern after a first pass through the code (which I know is still a draft) is that it requires 1 goroutine per successful request. This is similar to failure counting, but if we assume the majority of requests to be successful, then this will use significantly more resources than counting failures.

What I've typically leaned on instead is to have a ring buffer or a rolling/online algorithm depending on what the needs are. So for example, when I need to compute the standard deviation over a sliding window, I've used Welford's Online Algorithm which means I don't have to iterate the whole data set each time a new point is discovered.

I wonder if we could use something similar for this calculation, where we instead maybe keep the last N data points (if needed) or heck, maybe just a moving average? https://en.wikipedia.org/wiki/Moving_average

Apr 15 '23 14:04 mholt

Yeah, I'm definitely concerned about goroutine spam with this approach.

How would a rolling average type situation "forget" about old requests though? If there's no mechanism to remove old counts/entries then if you fall below the ratio, there's no way for the upstream to be healthy again.

Apr 15 '23 16:04 francislavoie

How would a rolling average type situation "forget" about old requests though?

This is a great question. One way is by configuring a window size, e.g. the last N requests. One way to do that is to have an array of size N and you just iterate it, round-robin style; but this requires iterating the entire array at each request to compute an average. An on-line average computation shouldn't require that extra pass through the data IIRC.

I need to find a little time to double-check / look this up, but I can probably recommend something along those lines.

Apr 19 '23 19:04 mholt

caddy caddy copied to clipboard

reverseproxy: Health as a ratio of successful requests

caddy
caddy copied to clipboard