coredns icon indicating copy to clipboard operation
coredns copied to clipboard

Configurable Timeout for `verify` Mode in `serve_stale` Cache Plugin

Open nitin-nizhawan opened this issue 6 months ago • 1 comments

What would you like to be added:

Please add support for an optional timeout parameter to the serve_stale 1h verify configuration of cache plugin. For example, the configuration

serve_stale 1h verify 100ms

should instruct CoreDNS to wait up to 100ms for the verify request to succeed. If upstream DNS does not respond within 100ms, If the upstream does not respond within this timeout, CoreDNS should immediately serve the stale cache entry, similar to the immediate mode.

Why is this needed: Currently, stale cache option in cache plugin provides two modes verify and immediate.

  • immediate: Immediately returns result from cache, in this case there is possibility of returning stale result even if the upstream DNS is healthy, this is something we would like to avoid.

  • verify: Tries to resolve query using upstream DNS before returning result from the stale_cache, but in this case if request takes a long time as upstream DNS in overloaded or unhealthy to resolve or the requests need to wait for full DNS timeout, the added latency in resolving the request is huge.

Both options are extreme: either no wait at all, or potentially a long wait for the upstream timeout. A configurable timeout would provide a balanced approach, allowing CoreDNS to serve fresh data when possible, but fall back to stale cache quickly if the upstream is slow or unavailable. This would improve both reliability and responsiveness for end users.

nitin-nizhawan avatar Jun 23 '25 11:06 nitin-nizhawan

I’ve started working on this.

✅ I reproduced the current serve_stale verify behavior using a minimal setup where the upstream DNS was unresponsive (127.0.0.1:5999).

With the following config:

.:1053 {
  forward . 127.0.0.1:5999
  cache 5 {
    serve_stale 1m verify
  }
  log
}

After the TTL expired, CoreDNS waited for the upstream to timeout (~1s) before responding (SERVFAIL or stale), confirming the latency issue.

I’ll be working on a patch to allow:

serve_stale 1m verify 100ms

This would cap the wait at 100ms for upstream verification before falling back to the stale cache, improving responsiveness under partial DNS outages.

I’ll follow up soon with a design note and open a clean PR once ready.

syedazeez337 avatar Aug 08 '25 05:08 syedazeez337