routedns Cache Prefetch Feature

To keep ahead of the curve, prefetching/refreshing particular records in cache before they get expired can potentional keep things speedy.

Start prefetching at a particlar percentage of TTL expired
- Default: 90%
Cached records should have at least a minimal number of hits to be concidered for prefetching
- Either a static number, default 3, or hits per time-unit (3 per minute for example)

Optional:

Only prefetch when idle (like low number of queries, or CPU usage below percentage)

Additionally:

Keep stale records for a definable time-period when prefetching cannot fetch a fresh copy (internet-connection gone for example), by extending the TTL by a definable number
- Default: 120 seconds

Prefetching means getting a fresh copy before expiration and renewing the TTL.

Jun 04 '20 09:06 cbuijs

I remembered I saw this one before: https://coredns.io/plugins/cache/

Jun 04 '20 13:06 cbuijs

I like this feature. I think this should be next

Aug 15 '21 14:08 charlieporth1

Gentle nudge.

Jun 28 '22 10:06 cbuijs

@folbricht what would be the best way to implement this? The DNS resolving part is pretty simple because it's already included in the code. My guess would be a polling frequency in seconds compared to the records ttl set to a time remaining expiry percent.

You could do it 2 ways a TTL prefetch element

Which could look like this

[cache-prefetch]
cache-ttl-polling-check-interval=60 # this would be in seconds as ttls are in seconds
min-record-time-remaining-percent=90 # this would be a percent based on the TTL time left with a check for next polling interval and if the next polling interval is would result in an expired record do the prefetch anyway.
cache-resolver=<cache to perform prefetch on> 
prefetch-resolver=<any cache resolver>
polling-record-size=1000 # which would be a record polling limiter an optional parameter for caches that are to large for the computer

Or we could add those parameters to the existing cache element minus the resolvers

Jun 30 '22 20:06 charlieporth1

@folbricht @cbuijs I've got most of the prefetching code written out but I can't seem to get it to refetch the query which is something due to my inexpensive with routedns and go If either one of you wants to take a look at it its on the branch prefetch-feature I get this error

TRAC[0122] cache err prefetch                            err="query for 'short-ttl-record.dns.test.ctptech.dev.' timed out"

FYI right now its logging more than it should only for debugging purposes documentation and comments will be added as a part of the release. @folbricht Have a good Canada day and don't work too hard ;)

Jul 01 '22 01:07 charlieporth1

@charlieporth1 No more Canada Day for me, different country.

So wrt prefetch, one thing we should make sure of is that only records that have recently been requested by a client should be considered for prefetch. If we refreshed everything in the cache, no records would ever be removed and the cache just keeps growing as it never forgets anything. What that means is that every record in the cache should have an additional attribute "this was recently fetched by a client". With that, prefetch can then decide what records to refresh (and reset the attribute).

With that, one could go further and perhaps count how often a record is requested by clients and refresh those more often, or implement a threshold like "only records that were requested N times in the last X seconds are prefetched".

It doesn't look like your branch has been pushed btw, there's only one commit and that doesn't yet touch the cache itself.

Jul 01 '22 09:07 folbricht

Oh, btw as for implementation, a simpler (but perhaps less flexible) way to achieve a prefetch would be a new element that we could put in front of the cache. Imagine we don't change anything about the cache itself, but we had a "cron" element that could be configured to issue a number of queries on a schedule. So you could configure a set of records and have them queries every X seconds. Then if this element is put behind a cache, it'd keep those records refreshed since they get queried through the cache.

Jul 01 '22 09:07 folbricht

Oh, btw as for implementation, a simpler (but perhaps less flexible) way to achieve a prefetch would be a new element that we could put in front of the cache. Imagine we don't change anything about the cache itself, but we had a "cron" element that could be configured to issue a number of queries on a schedule. So you could configure a set of records and have them queries every X seconds. Then if this element is put behind a cache, it'd keep those records refreshed since they get queried through the cache.

How would it now what is in cache? Or does it track via a separate table? Seems to be "double up" maybe.

Jul 01 '22 10:07 cbuijs

@charlieporth1 No more Canada Day for me, different country.

So wrt prefetch, one thing we should make sure of is that only records that have recently been requested by a client should be considered for prefetch. If we refreshed everything in the cache, no records would ever be removed and the cache just keeps growing as it never forgets anything. What that means is that every record in the cache should have an additional attribute "this was recently fetched by a client". With that, prefetch can then decide what records to refresh (and reset the attribute).

With that, one could go further and perhaps count how often a record is requested by clients and refresh those more often, or implement a threshold like "only records that were requested N times in the last X seconds are prefetched".

It doesn't look like your branch has been pushed btw, there's only one commit and that doesn't yet touch the cache itself.

Yeah, for sure. Only cached records will come in play. And they will expire if not queried in a particular time-frame. I think this is good reading to get some ideas:

https://coredns.io/plugins/cache/

Jul 01 '22 10:07 cbuijs

Oh, btw as for implementation, a simpler (but perhaps less flexible) way to achieve a prefetch would be a new element that we could put in front of the cache. Imagine we don't change anything about the cache itself, but we had a "cron" element that could be configured to issue a number of queries on a schedule. So you could configure a set of records and have them queries every X seconds. Then if this element is put behind a cache, it'd keep those records refreshed since they get queried through the cache.

How would it now what is in cache? Or does it track via a separate table? Seems to be "double up" maybe.

It wouldn't, it's just a way to ensure records are queried regularly and therefore be kept fresh in the cache. It's a pretty static solution since it requires pre-defining which records need to be kept updated.

Jul 01 '22 10:07 folbricht

@folbricht I personally am a fan of dynamic-based prefetch because of the ability to load frequently used records. You could count the times read from cache or queried from a resolver and have a user-defined number of x queries or more. I think you could avoid doubling up by adding an element to the cache group of prefetch-resolver We do already count query hits in cache.

[groups.cloudflare-cached-with-prefetch]
type = "cache"
prefetch-resolver = "catch-prefetch"

[groups.catch-prefetch]
type = "catch-prefetch"
cache-ttl-polling-check-interval=60 # this would be in seconds as ttls are in seconds
record-query-hits-min = 10 
tll-expiry-percent = 90

Jul 01 '22 16:07 charlieporth1

@folbricht I did similar to what you said but is a dynamic prefetch based on request hits. Let me know if this is ok if not np - I'm just excited to be learning go. Same branch as before. It is currently missing a TTL based prefetch and fetching based on user setting but I would like to test it out to make sure there are no wacky errors with my code

Jul 01 '22 22:07 charlieporth1

I like the simplicity of this: Cache prefetch in BIND

It is also query triggered, which I think makes sense.

Jul 07 '22 09:07 cbuijs

I looked at this a bit more and I do like the simplicity of the prefetch in BIND. There's a fundamental issue though, not just with that solution but with all of them. Since routedns is forwarding to a recursive resolver, it doesn't normally get the real TTL, but the TTL from the upstream cache.

So for example we send a query for example.com to upstream and get a TTL of 60 in the response. Now after lets say 50s, prefetch will re-query example.com to keep the record fresh in the cache. The issue is though that this new response will not have a TTL of 60 again, but insteas just 10. So we didn't gain anything by querying again before the TTL expired upstream. We would always have to wait for records to expire out of the upstream cache before prefetching.

It may be possible to work around this using the BIND algorithm, but extending the TTL in the local cache by the "Trigger" amount. That way the record remains valid in the local cache, even though it has expired upstream.

Aug 07 '22 11:08 folbricht

I implemented a draft feature on the prefetch-3 branch. This functions very similar to what is described in https://kb.isc.org/docs/aa-01122, without any workarounds for upstream caches at this point. It'd be good to see how this works as-is first.

It's simple enough to be integrated into the cache itself and is configured like so:

[groups.cloudflare-cached]
type = "cache"
resolvers = ["cloudflare-dot"]
cache-prefetch-trigger = 10   # Prefetch when the TTL has fallen below this value
cache-prefetch-eligible = 20  # Only prefetch records if their original TTL is above this

Aug 07 '22 13:08 folbricht

So far so good, no real issues other that logging is not very descriptive.

It would be nice records have actually cache-hits on them are prefetched, and make it configurable. Saying something like this:

cache-prefetch-trigger = 10   # Prefetch when the TTL has fallen below this value
cache-prefetch-eligible = 20  # Only prefetch records if their original TTL is above this
cache-prefetch-cache-hits = 5 # Only prefetch records if it was "cache hit" at least 5 times

Default would be all records in cache. With this, cache becomes less static and probably also smaller.

Aug 09 '22 12:08 cbuijs

Something weird going on when using the Prefetch feature.

After the first prefetch (and consecutive ones), it looks like routedns is returning different or empty answers and not the prefetched one. I cannot replicate it easy, but stuff in web-browsers start to fail to load, and goes away when I disable prefetching. It seems to be sporadic and not for every prefetch/query.

Will try to debug it more when in the home-office where I have better facilities to check it out.

Aug 11 '22 11:08 cbuijs

Weird, if there's an issue that suggests there could be a bug in the existing cache. All it does is send another request when the TTL of the existing cached record falls below the trigger time. Were you able to reproduce it?

As for logging, there's only one change, it prints when it prefetches a record, but since it re-sends the same query (with the same client info), all the following logging lines will look like it went through the cache. It could be confusing, but not sure how to change that. I could perhaps change the client-ip in the prefetch requests or so.

Aug 16 '22 08:08 folbricht

Two observations:

When querying I assumed the TTL in the answer will not drop below cache-prefetch-trigger, or maybe just 1 second under. But it seems to go back all to zero before it prefetches (prefetch log entry appears), which is the opposite what we want and just the same as without prefetch.

It seems (cannot test really well), that it actually gets funky when the original ttl is below cache-prefetch-eligible, the prefetch actually still happens and a empty or error response is send. I am travelling so cannot see stuff very well and lack tools. Will get on this when home later in more detail.

Aug 16 '22 10:08 cbuijs

Two observations:

When querying I assumed the TTL in the answer will not drop below cache-prefetch-trigger, or maybe just 1 second under. But it seems to go back all to zero before it prefetches (prefetch log entry appears), which is the opposite what we want and just the same as without prefetch.

It seems (cannot test really well), that it actually gets funky when the original ttl is below cache-prefetch-eligible, the prefetch actually still happens and a empty or error response is send. I am travelling so cannot see stuff very well and lack tools. Will get on this when home later in more detail.

It's because of a wrong var name Line 126 of cache.go is

                        if min, ok := minTTL(a); ok && min < r.CacheOptions.PrefetchTrigger {

It should be using this var PrefetchEligible uint32 @folbricht

Aug 17 '22 14:08 charlieporth1

@folbricht I made the change i suggested. @cbuijs could you test this to make sure. @folbricht if we are ready to PR this let me know by making a PR with my review assigned

Aug 21 '22 07:08 charlieporth1

Hmm, I think that line is correct. The eligibility check happens prior to that

Aug 21 '22 07:08 folbricht

Still doesn't work.

Just thinking up loud: Could it be that because the TTL expires (as it is countdown to zero), the record is purged and it also purged the prefetched one? It seems to be sporanic and not easy to test. But when prefetching is switched on, I experience problems on web-pages half-loading etc within minutes. When switched-off, all good.

I am trying to debug some of these web-pages to see what is not loading and see which domains, but somehow when I do a dig it always works and get seemingly correct answers.

I did notice once that I got a response with a TTL of zero. Which was unexpected. Might be that dig allows it, but the DNS client or Browser might be not. Cannot replicate it. It was for the domain occ-0-6144-769.1.nflxso.net..

Aug 22 '22 06:08 cbuijs

Interesting, I need to do more testing and add some logs to help perhaps.

Aug 22 '22 07:08 folbricht

@folbricht could it be the gc garbage collector?

Aug 22 '22 13:08 charlieporth1

I took another look and added an extra check to avoid caching prefetched records that have an even lower TTL than what's in the cache already. Perhaps this will prevent the 0 TTL records you saw. Other than that nothing jumps out as being obviously wrong. It's a fairly simple implementation. We do however will always have an issue with upstream caches. That's regardless of implementation. If upstream just counts down the TTL then prefetching won't do anything. It'd only really help if a query close to expiry will make upstream return a record with higher TTL.

This is on the prefetch-3 branch btw, since we have multiple implementations

Aug 24 '22 06:08 folbricht

Seems to work much better now, no issues so far. The remark of upstream caches makes sense, I put a ttl-modifier min-ttl after the cache to make sure the minimum is always met, that might actually have done the trick anyway.

Maybe an idea to set the min-ttl to cache-prefetch-eligible when the original TTL is lower anyway? Maybe too bit flacky/hacky. Maybe as how I did it above already does so, and is more flexible/choice. Just as a thought.

Aug 24 '22 09:08 cbuijs

@folbricht I'm actively using this branch and it looks like #issue-257 is affected so I'm going to do a PR of master->prefetch-3. It builds and runs just fine

Sep 03 '22 20:09 charlieporth1

Somehow that PR disappeared, so just opened a new one here: https://github.com/folbricht/routedns/pull/279 (same prefetch-3 branch).

Jan 26 '23 11:01 folbricht

Works.

Feb 14 '23 11:02 cbuijs