caddy Proxy load balancing retry issues

There's a few different points to talk about here.

Firstly -- I've been playing around a bit with lb_try_duration and I think this doesn't necessarily always have the desired effect because of when the "duration" starts. The timer for lb_try_duration starts right when the reverse_proxy module gets the request.

This sounds fine in isolation, but if you consider that errors that might trigger retries might take a while to actually happen (e.g. dial timeout after Caddy's default of 10s -- more on that later) then it becomes hard to configure Caddy to only retry once or twice after a failure. It's probably undesirable to set an lb_try_duration of 15 seconds just to make sure the 10 second timeouts are caught and trigger retries.

For example, try a config like this:

{
	debug
}

:7000 {
	reverse_proxy 10.0.0.255:80 :7001 {
		lb_policy first
		lb_try_duration 3s
		fail_duration 30s
	}
}

:7001 {
	respond "7001"
}

Assuming 10.0.0.255:80 does not exist, this should trigger a dial timeout (default 10s). Trying it with curl localhost:7000, you see it hang for 10 seconds, then in the logs we get ERROR dial tcp 10.0.0.255:80: i/o timeout, and no fallback to :7001 happens. If we increase lb_try_duration to 15s instead, then we get a hang for 10 seconds then a successful fallback to :7001 does happen, and no error appear in the logs (but a DEBUG level log does appear for the initial error).

There's a few different ways I can think of that this could be made better:

Make the try duration timer start from the time the first attempt fails, instead of the beginning of the request.

This would mean that the amount of times a retry would happen would be easier to predict. This has some implications though, because not all errors take the same amount of time. If some errors happen fast, you don't want to cause a stampede of stacked retries when you have a busy server, so it would be better to be able to cap the amount of retries (because potentially, a 15s try duration with a 250ms interval has a potential of 15s/0.25s = 60 retries)
Implement a lb_try_min_times <amount> option which can complement or replace lb_try_duration; a minimum number of times to try to pick an upstream (default 1). If specified alongside lb_try_duration, then it should try again even if the duration has elapsed.

This would make it possible to make slow errors at least attempt a retry, but still allow capping fast errors with the duration timer. This could still end up leading to long waits for the clients though if each retry take a long time (e.g. 10s for dial timeout, say with lb_try_min_times 3 would mean a 30s wait if all upstreams are down, at worst). So obviously in these situations, the dial timeout is too long, so...
Reduce the default http transport's dial timeout from 10s to maybe 5s or even 2s.

I'm sure this might have some implications for some potentially high latency scenarios, but you'd hope not. FWIW, Envoy uses a 5s default, Traefik uses a 30s default, HAProxy has no default but recommends >3s, with most examples in the docs at 5s.

I wouldn't call this a complete solution, but it's "a start".

As a note, Caddy currently retries all DialError, or any error for GET requests by default (any error emitted by Caddy or the stdlib), and does not retry any status code errors received from upstream (if the upstream returned a 502 response of its own or whatever). There is the retry_match mechanism (which is lacking Caddyfile support -- could probably be easily implemented) which allows for matching requests, but doesn't allow for retrying on specific kinds of responses unless handle_response is used which is kinda janky. It would be nice to have support for matching responses that should be retried.

HAProxy has some interesting stuff on the topic of retries that we could take inspiration from: https://cbonte.github.io/haproxy-dconv/2.4/configuration.html#4-retries

HAProxy also supports "connect-only" or "connect+tls" active checks -- this would be a nice-to-have for Caddy as well, since they're much lighter weight than actual HTTP health checks. HAProxy defaults to 2s interval for those.

As an aside, issues https://github.com/caddyserver/caddy/issues/1545 and https://github.com/caddyserver/caddy/issues/4174 are related, e.g. Caddy doesn't use more than the first SRV backend when retrying because there's no state about selection kept for the lifetime of a request which would make it possible to try "lower priority" results of a DNS lookup.

While investigating this, I noticed that keepalive defaults for the http transport only seem to apply if no transport was configured. This seems like a bug; the keepalives should have defaults regardless. This means that if you configure dial_timeout to a shorter value like 2s to work around the issues above, then you lose the keepalives altogether. This should be fixed, but didn't really warrant its own issue, so I'm mentioning it here for now.

Jul 19 '21 07:07 francislavoie

Great points, thanks for bringing them up for discussion.

Firstly -- I've been playing around a bit with lb_try_duration and I think this doesn't necessarily always have the desired effect because of when the "duration" starts. The timer for lb_try_duration starts right when the reverse_proxy module gets the request.

Hmm, yeah that is tricky because the idea is that you don't want the request to block for more than lb_try_duration in attempts to reach an available backend. It's probably not safe if the first backend tried takes an unbounded amount of time. This can, of course, be controlled by the dial timeout as you mention...

As for possible improvements:

Make the try duration timer start from the time the first attempt fails, instead of the beginning of the request.

:x: Let's try to avoid this, as the first backend that is tried still counts as a "try".

Implement a lb_try_min_times option which can complement or replace lb_try_duration; a minimum number of times to try to pick an upstream (default 1). If specified alongside lb_try_duration, then it should try again even if the duration has elapsed.

:x: I'd also like to avoid this, as it increases complexity; I'm also not sure this is ever what people actually want -- no one cares how many times backends are tried, they just want to establish the connection as quick as possible with some cap on latency (lb_try_duration).

Reduce the default http transport's dial timeout from 10s to maybe 5s or even 2s.

:heavy_check_mark: This is a good plan -- I'm actually surprised that other proxies' docs recommend > 3s, I would think that internal backends should be able to connect within even 1s -- 3 seems realllly slow, let alone 10. Let's definitely lower it. I'm not sure to what. 5s sounds good, but I'd probably be OK with going lower too. Who wants to wait even 2s just for a connection to be established to an internal backend??

(If only we had telemetry to know the percentile of backend connections that take longer than 2-3s. :roll_eyes: )

As a note, Caddy currently retries all DialError, or any error for GET requests by default (any error emitted by Caddy or the stdlib), and does not retry any status code errors received from upstream (if the upstream returned a 502 response of its own or whatever). There is the retry_match mechanism (which is lacking Caddyfile support -- could probably be easily implemented) which allows for matching requests, but doesn't allow for retrying on specific kinds of responses unless handle_response is used which is kinda janky. It would be nice to have support for matching responses that should be retried.

Health checks absolutely allow for this very easily, you can expect a certain (class of) status code, as well as a body or timeout: https://caddyserver.com/docs/modules/http.handlers.reverse_proxy#health_checks/active

As for these things:

As an aside, issues #1545 and #4174 are related, e.g. Caddy doesn't use more than the first SRV backend when retrying because there's no state about selection kept for the lifetime of a request which would make it possible to try "lower priority" results of a DNS lookup.

:heavy_check_mark: Would definitely accept a contribution here from someone.

While investigating this, I noticed that keepalive defaults for the http transport only seem to apply if no transport was configured. This seems like a bug; the keepalives should have defaults regardless. This means that if you configure dial_timeout to a shorter value like 2s to work around the issues above, then you lose the keepalives altogether. This should be fixed, but didn't really warrant its own issue, so I'm mentioning it here for now.

:heavy_check_mark: Similarly, this would be a good fix I'd welcome from anyone who submits a PR. (I'm a little swamped right now with the website projects.)

Jul 19 '21 15:07 mholt

I'm also not sure this is ever what people actually want -- no one cares how many times backends are tried, they just want to establish the connection as quick as possible with some cap on latency (lb_try_duration).

I don't really agree, I'd expect it to be possible to easily configure Caddy to attempt at least one try on error if the error was in connecting or completing a handshake with the first upstream. I don't think time is necessarily an absolute here.

It's often more important for UX to make sure as many requests as possible actually give a positive result than unexpectedly having an error which may cause disruption for the user. If other upstreams are available, I'd rather it at least gave another a try before serving the error to the client.

Health checks absolutely allow for this very easily, you can expect a certain (class of) status code, as well as a body or timeout: caddyserver.com/docs/modules/http.handlers.reverse_proxy#health_checks/active

Right, but that's only for active health checks, not for retries. I was talking about retries here. Those are different features altogether (but they complement eachother).

Active health checks are useful to mark upstreams as unhealthy as soon as possible out of band from client requests, but they don't help with ensuring that retries happen to avoid responding to the client with errors when another upstream may successfully deal with the request.

Jul 19 '21 16:07 francislavoie

I don't really agree, I'd expect it to be possible to easily configure Caddy to attempt at least one try on error if the error was in connecting or completing a handshake with the first upstream.

It does try once, it just won't retry in that case. If it doesn't retry, then that's due to a misconfiguration regarding the try duration and timeouts. If the timeout is longer than the whole try duration, then it makes sense that it won't retry. I do agree we should adjust the default dial timeout.

Right, but that's only for active health checks, not for retries. I was talking about retries here.

Oh, I see. Is there an actual use case that requires determining retry logic by reading the response body?

Jul 19 '21 21:07 mholt

If it doesn't retry, then that's due to a misconfiguration regarding the try duration and timeouts.

Again, I don't agree. Timeouts and try duration are not a good enough guarantee that at least one more retry will be attempted in all potentially recoverable error situations.

Oh, I see. Is there an actual use case that requires determining retry logic by reading the response body?

Yes. Like I said HAProxy allows for it -- it has different retry modes listed at the link I gave, including looking at the response code.

Jul 19 '21 21:07 francislavoie

Again, I don't agree. Timeouts and try duration are not a good enough guarantee that at least one more retry will be attempted in all potentially recoverable error situations.

Why?

Jul 19 '21 21:07 mholt

Because timeouts aren't the only source of possible recoverable errors. See HAProxy's list here: https://cbonte.github.io/haproxy-dconv/2.4/configuration.html#4.2-retry-on

If high availability is the goal, a robust and flexible retry mechanism is pretty important.

Jul 19 '21 22:07 francislavoie

Ok. Let's adjust the default timeout, and consider adding new optional retry logic; maybe similar to how active health checks work, as opposed to "number of times" though. Will have to give it more thought when I have a chance (am running out the door now).

Jul 21 '21 00:07 mholt

@francislavoie Was work ever started regarding retries for a given response code? I'm receiving periodic 502s from Google Cloud Storage, and their docs indicate that one should retry with exponential backoff

Feb 16 '22 21:02 socketbox

Hmm, you're using Caddy to reverse proxy to GCS? That sounds strange.

But no, we haven't done any work on changing the conditions for retries yet. Some work may be done on that after #4470 is merged though.

We did adjust the dial_timeout defaults since, in #4436

Feb 16 '22 23:02 francislavoie

So to recap here after #4756 is merged:

We now have lb_retries for maximum amount of retries, which pairs with the existing lb_try_duration (total amount of time to retry) in a "whichever comes first" relationship.
We now have lb_retry_match which matches requests on which it's okay to perform retries.

This is great, but there's still some areas where this is lacking:

Retry policies, i.e. retrying even if connection was successful according to certain rules.
- Right now, we do retry GET requests, or if lb_retry_match is configured and matches.
- Maybe it should be possible to retry for some of these cases, on an opt-in basis (taken from https://cbonte.github.io/haproxy-dconv/2.4/configuration.html#4.2-retry-on). Some of these can be risky but we should let the user choose if they want it:
  - Empty response (no response body) or socket closed before receiving a response
  - On TLS handshake error (I'm not sure if we already do for these or not, but writing it down here for now)
  - Response timeout (i.e. response_header_timeout HTTP transport option, or read_timeout fastcgi transport option), i.e. server took too long to respond
  - By response status code (e.g. server actually responded with 503 Service Unavailable from an upstream proxy)
- The trouble with some of these though is that if we started copying the request body to an upstream at all, we can't rewind, unless we buffered the request body (we support that with buffer_requests but I don't think we have anything in place to cause that to be reused). We will need to document that lb_retry_match should be used with care in case the request does have a body.

I've given up on the idea of lb_try_min_times, because it doesn't play very nicely with lb_try_duration. Just don't set an lb_try_duration, and configure lb_retries for an amount of times to retry. But someone might later ask for something like lb_retry_timeout or something as an upper bound to stop before lb_retries is reached after a certain time limit. But that usecase is probably pretty unlikely to materialize.

Jul 14 '22 04:07 francislavoie