caddy Support falling back to cached upstreams IPs when DNS lookups fail

Support falling back to cached upstreams IPs when DNS lookups fail

Open cds2-stripe opened this issue 1 year ago • 4 comments

We'd like to request support for temporarily falling back to cached upstream IP addresses in the event an upstream DNS query fails. We have a situation where an upstream service cluster can temporarily report as unhealthy, even though the vast majority of the upstream node IPs are continuing to function, and we'd like to not immediately fail all requests.

This situation looks like:

We have a dynamic upstream service which is doing load balancing by an HTTP header that resolves to multiple IPs
This cluster becomes temporarily unavailable and SRV lookups return an NXDOMAIN from our service mesh
Caddy fails the DNS lookup and returns a 502

We'd like to be able to tell Caddy to use the cached IPs for a specified period of time, this would help insulate flakey services.

Let me know if I can provide more information, thank you!

Sep 12 '23 20:09 cds2-stripe

Ahh, I didn't consider the DNS lookup to be a point of failure... :upside_down_face: (Insert "It's always DNS" meme here)

We'd like to be able to tell Caddy to use the cached IPs for a specified period of time, this would help insulate flakey services.

To be clear, this already happens (that's the "refresh" config parameter, default is 1 minute) -- but currently, after that period of time, it will try to get new ones, and if it can't it will error.

The simplest fix I can think of is to log the error and return the cached IPs. Would that work for you?

This may potentially perform a lot of failing lookups though (although only ever 1 at a time) resulting in slight service degradation (because higher latency).

So I guess that's why you're asking for a way to use the cached IPs "for a specified period of time" -- so maybe we add a new config parameter like a grace_period that specifies a length of time between lookup attempts. (Or a better name for this param, if you can think of one...)

Am I on the right track? I'll try to push a branch soon as a starting point.

Sep 14 '23 22:09 mholt

@cds2-stripe @jjiang-stripe I've created a PR #5832 that I believe will resolve the issue, if my intuition was right.

Sep 21 '23 17:09 mholt

This looks great - thank you! I'll let you know when we're done testing.

Sep 21 '23 19:09 cds2-stripe

Hopefully, this fix works! I've seen this issue with A records as well. Thanks for reporting

Sep 23 '23 13:09 kkroo

caddy caddy copied to clipboard

Support falling back to cached upstreams IPs when DNS lookups fail

caddy
caddy copied to clipboard