caddy
caddy copied to clipboard
Support falling back to cached upstreams IPs when DNS lookups fail
We'd like to request support for temporarily falling back to cached upstream IP addresses in the event an upstream DNS query fails. We have a situation where an upstream service cluster can temporarily report as unhealthy, even though the vast majority of the upstream node IPs are continuing to function, and we'd like to not immediately fail all requests.
This situation looks like:
- We have a dynamic upstream service which is doing load balancing by an HTTP header that resolves to multiple IPs
- This cluster becomes temporarily unavailable and SRV lookups return an NXDOMAIN from our service mesh
- Caddy fails the DNS lookup and returns a 502
We'd like to be able to tell Caddy to use the cached IPs for a specified period of time, this would help insulate flakey services.
Let me know if I can provide more information, thank you!
Ahh, I didn't consider the DNS lookup to be a point of failure... :upside_down_face: (Insert "It's always DNS" meme here)
We'd like to be able to tell Caddy to use the cached IPs for a specified period of time, this would help insulate flakey services.
To be clear, this already happens (that's the "refresh" config parameter, default is 1 minute) -- but currently, after that period of time, it will try to get new ones, and if it can't it will error.
The simplest fix I can think of is to log the error and return the cached IPs. Would that work for you?
This may potentially perform a lot of failing lookups though (although only ever 1 at a time) resulting in slight service degradation (because higher latency).
So I guess that's why you're asking for a way to use the cached IPs "for a specified period of time" -- so maybe we add a new config parameter like a grace_period
that specifies a length of time between lookup attempts. (Or a better name for this param, if you can think of one...)
Am I on the right track? I'll try to push a branch soon as a starting point.
@cds2-stripe @jjiang-stripe I've created a PR #5832 that I believe will resolve the issue, if my intuition was right.
This looks great - thank you! I'll let you know when we're done testing.
Hopefully, this fix works! I've seen this issue with A records as well. Thanks for reporting