pdns icon indicating copy to clipboard operation
pdns copied to clipboard

Re-fetch almost-expired cache records

Open g0tar opened this issue 1 month ago • 3 comments

  • Program: dnsdist
  • Issue type: Feature request

Short description

Apparently dnsdist doesn't have analogue of recordcache.refresh_on_ttl_perc from recursor.

Usecase

Consider dnsdist used to load-balance traffic to non-authoritative forwarders. When cached response's TTL is going low and some client asks for a record, I'd like it to be responded from the cache and then (after response) proactively retried (forwarded) to keep the cache warm for next client.

Description

When dnsdist is put before authoritative server there's the dontAge cache setting. But this can't be used in forwarder setup.

This is even worse, as forwarders age their cached TTLs as well, so for short-TTL-valued domains one can get additional latencies query after query. One can use SetReducedTTLResponseAction() or SetMinTTLResponseAction() to keep this more sane, but eventually every cached entry expires and next client runs into latency spike. The same would happen with packet cache initialized with minTTL value set.

Proposed solution makes every response available from the cache (even stalled ones), but still kept warmer for consecutive queries. Having multiple forwarders should naturally help, as they wouldn't be all in-sync, so even if one forwarder returns short TTL, next query might hit second one with longer one. When using forwarders dnsdist doesn't know original (authoritative) TTL for a domain, so this setting should use seconds not percents.

When set to values longer than some domain's max, this would forward each and every query into forwarder, but in a more efficient manner than disabling cache entirely - the client would obtain previously cached response first, and only after this the cache would be refreshed.

g0tar avatar Dec 02 '25 10:12 g0tar

Apparently dnsdist doesn't have analogue of recordcache.refresh_on_ttl_perc from recursor.

It doesn't, and cannot because it doesn't have a record cache and we don't keep the query packet in the cache so even if we wanted we couldn't re-send a query to the backend later to get a fresh entry.

When cached response's TTL is going low and some client asks for a record, I'd like it to be responded from the cache and then (after response) proactively retried (forwarded) to keep the cache warm for next client.

The current code makes that difficult:

  • the buffer containing the query is reused to store the response on a cache-hit, to improve performance, so the query is gone by the time we could make a decision like that, and we don't want to allocate a new buffer for every cache-hit just to make this feature possible
  • dnsdist doesn't really know how to send a query that is not tied to a client's request. We could write code for that but that would be a decent amount of code which seems too much for this feature.

What we have been pondering lately is to add an option to skip a cache-hit 1 out of T times when the remaining validity of an entry is lower than a V value. This would allow some queries to go through to the backend, which either still has a valid cached entry for it, potentially triggering a refresh mechanism, or doesn't and refreshes it.

rgacogne avatar Dec 02 '25 11:12 rgacogne

What we have been pondering lately is to add an option to skip a cache-hit 1 out of T times when the remaining validity of an entry is lower than a V value.

When considering this FR I got similar (much trivial) idea, to ignore entirely (like drop?) cached entries with TTL below specified threshold and put recursor with refresh_on_ttl_perc before dnsdist.

This way recursor is mostly in-sync with dnsdist, so they both share the same TTL value in caches. When recursor refreshes at 90% of (initial) TTL elapsed, dnsdist must have this entry already dropped from the cache to retry backends. Unfortunately I didn't find any way to do this, only the LimitTTLResponseAction() to have some sane initial TTL to start with.

Another idea for dnsdist would be to return different TTL than actually cached - but I guess this is also not suitable for packet cache, so low TTL pre-expiration looks better. I also considered running recursor with not-caching dnsdist, but I'd prefer to have warm restart possibility.

Your solution seems to fill the gap provided I can set to always skip the cache (1/1, in order to make recursor always triggering refresh, not hitting the about-to-be-expired entry again; or maybe 9/10 to keep some odds for stalled records?). Is there any ticket opened for this?

Actually, after more considerations - to keep the stalled cache working it could simply mark/treat entries below some TTL value as already expired. This could be packet cache's property: minTTLtoRefresh. Instead of disabling cache entirely, just make it's contents more selectively expired (minTTL prevents packets from being saved into cache, my proposal makes it go away faster than expected).

g0tar avatar Dec 02 '25 11:12 g0tar

I've just had and verified an idea, that the in-cache duration is shortened by maxTTL packet cache setting, i.e. this reduces the entry lifespan, but doesn't reduce returned TTL value itself. Therefore my minTTLtoRefresh=90 is equal similar to having relative maxTTL=originalTTL-90. Similar, as the difference seems to be "time to sit in the cache" (not the TTL value itself) vs "minimum TTL value of entries to be kept in the cache". Meanwhile, instead of disabling the cache entirely, I'll set maxTTL=90 just to keep it warmish.

g0tar avatar Dec 03 '25 17:12 g0tar