AdGuardHome icon indicating copy to clipboard operation
AdGuardHome copied to clipboard

Optimistic caching for client cache

Open WildByDesign opened this issue 1 year ago • 21 comments

Prerequisites

Platform (OS and CPU architecture)

Linux, ARMv7

Installation

GitHub releases or script from README

Setup

On a router, DHCP is handled by AdGuard Home

AdGuard Home version

v0.107.43

Action

Recently, AGH added the new feature for Custom Upstreams Cache (upstreams_cache_enabled and upstreams_cache_size).

After using that new setting for a few weeks, I realized that it was not caching requests.

Expected result

Expected custom upstream requests to be cached.

Actual result

AGH-upstreams_cache_enabled

From the Dashboard screenshot above, the clients circled in blue and their corresponding upstreams appear to be caching correctly. These do no have custom upstream servers specified.

The clients circled in red and their corresponding upstreams do have custom upstream servers specified along with the new settings upstreams_cache_enabled and upstreams_cache_size. From the dashboard, it appears that the custom upstream servers are taking all of their requests and none are being cached.

I enabled the Query Log and monitored those clients for some period of time and observed no requests being served from cache.

Additional information and/or screenshots

Persistent client info from adguardhome.yaml:

    - safe_search:
        enabled: true
        bing: true
        duckduckgo: true
        google: true
        pixabay: true
        yandex: true
        youtube: true
      blocked_services:
        schedule:
          time_zone: UTC
        ids:
          - roblox
      name: redacted
      ids:
        - redacted
        - redacted
        - redacted
      tags: []
      upstreams:
        - 208.67.222.123
        - 208.67.220.123
      upstreams_cache_size: 1048576
      upstreams_cache_enabled: true
      use_global_settings: false
      filtering_enabled: true
      parental_enabled: false
      safebrowsing_enabled: false
      use_global_blocked_services: true
      ignore_querylog: false
      ignore_statistics: false
    - safe_search:
        enabled: true
        bing: true
        duckduckgo: true
        google: true
        pixabay: true
        yandex: true
        youtube: false
      blocked_services:
        schedule:
          time_zone: UTC
        ids:
          - roblox
      name: redacted
      ids:
        - redacted
        - redacted
        - redacted
      tags: []
      upstreams:
        - 208.67.222.123
        - 208.67.220.123
      upstreams_cache_size: 1048576
      upstreams_cache_enabled: true
      use_global_settings: false
      filtering_enabled: true
      parental_enabled: false
      safebrowsing_enabled: false
      use_global_blocked_services: true
      ignore_querylog: false
      ignore_statistics: false

WildByDesign avatar Dec 30 '23 03:12 WildByDesign

We cannot reproduce this. These statistics could simply be caused by these clients requesting a larger quantity of diverse domains. A better way to see if the request has been answered from cache or not is through the query log, as queries that had received a cached response are marked with (served from cache).

ainar-g avatar Jan 10 '24 10:01 ainar-g

I’ve been digging into this issue more over the past few days. Checking the Query Log shows no entries with (served from cache) for these devices.

Interestingly, these were all iOS devices. So I decide to configure a Windows laptop with custom upstreams and it did work as expected.

When I removed the custom upstreams, the iOS devices were all caching properly. When I enabled the custom upstreams again, waited 24-48 hours for things to settle, no caching again.

AGH does cache correctly with iOS device as long as they don’t use custom upstreams. This is a really weird issue and I’m still trying to figure out.

The only time that AGH caches the iOS devices with custom upstreams is when I specifically use the dig command in various apps. But when using Safari and custom upstream, AGH does not cache at all.

Safari is 99% of the use case on these devices. Yet AGH does cache these devices with all Safari usage when custom upstreams are not used.

Therefore the issue is iOS devices using Sarari with AGH using custom upstreams specifically. This is the scenario that seems to trigger this issue.

WildByDesign avatar Jan 14 '24 15:01 WildByDesign

I have let the cache build up for 3-4 days now and have some interesting results. The Windows laptop using custom upstreams was not caching as much I was hoping.

All devices without custom upstreams: 8% of requests sent upstream / 92% served from cache

Windows laptop with custom upstreams cache enabled: 96% of requests sent upstream / 4% served from cache

iOS devices with custom upstreams cache enabled: 100% of requests sent upstream / 0% served from cache

There is definitely something going on with the new custom upstreams cache feature.

WildByDesign avatar Jan 16 '24 00:01 WildByDesign

@WildByDesign I think maybe the problem you think is happening, is happening because it seems that each client using a custom upstream has its own cache.

i.e. if I have 5 hosts, they all use 192.168.0.1 as their custom upstream, but they'rea all configured as different clients, then you've got 5 times you've got to enable "custom upstream cache" and that cache, even though it's the same upstream, doesn't seem to be shared amoungst those clients. You have 5 seperate caches, not 1 shared pool cache for 192.168.0.1

That's what I've noticed anyway, I define a lot of hosts by a /32 IP and they get a custom upstream DNS server. It's the same one, 192.168.0.1 but I've noticed that if my .10 client requets something, when .11 requests it, it doesn't come from the cached lookup that .10 would have created.

Thus you're only going to be seeing cached entires when the actual client has expired its cache, but the Adguard server still has a cached entry.

That's at least how I understand/see the feature working. It would be good if all clients configured with a custom DNS server of 192.168.0.1 used a shared cache, but at least in my testing/experience that's not how it seems to be working.

The "fix" is to define a single Client with all IPs you want to use that upstream DNS server. I don't want to do this though as I want to see statics/requests per device, not all aggreated up into a /24 etc.

tjharman avatar Jan 18 '24 01:01 tjharman

The "fix" is to define a single Client with all IPs you want to use that upstream DNS server. I don't want to do this though as I want to see statics/requests per device, not all aggreated up into a /24 etc.

I was assuming the same, as you said, how each Client configured with a custom upstream DNS would each have it's own cache and therefore I would have multiple instances of separate cache going on. I was also thinking about memory usage with regard to that.

Also, I was considering putting the two child devices together under one Client because it would be more efficient to have them share the same cache considering they shared the same upstreams. However, like you, I decided not to follow through with that because I really wanted to keep their statistics separate as well.

For testing purposes, I had a only a single device / single Client configured with custom upstreams to simplify the overall testing for a few days. But unfortunately, it still wasn't caching much at all in comparison to not using custom upstreams.

When using custom upstreams, my Average processing time would always be between 13-15 ms. When I decided to test for a few days without using any custom upstreams at all, my Average processing time sits between 1-2 ms. It seems to support my issue in that the caching involved with custom upstreams is not working as it should. It's caching only a very small percentage in comparison to the caching done when not using custom upstreams.

For now, I have disabled custom upstreams and will wait until the next release to see if anything improves.

WildByDesign avatar Jan 18 '24 12:01 WildByDesign

@ainar-g I figured out the source of the bug.

The user Client name had a space in the name. I changed the space to a hyphen and it is caching ferociously now.

This only seems to be an issue when using custom upstreams.

WildByDesign avatar Jan 18 '24 17:01 WildByDesign

@WildByDesign Are you saying if I change the name of my kid's iPad from "Daughters iPad" to "Daughters-iPad" caching will improve?

tjharman avatar Jan 18 '24 17:01 tjharman

That’s what I just experienced. So it’s worth a try.

WildByDesign avatar Jan 18 '24 18:01 WildByDesign

Unfortunately, the hyphen theory ended up being wrong. After 24 hours, I am getting ~4% of requests served from cache on the client that I am testing recently with custom upstreams.

But if I remove the custom upstreams, I get ~90%+ requests served from cache.

I can't figure out why this new custom upstreams cache feature is only partially working.

WildByDesign avatar Jan 20 '24 00:01 WildByDesign

@schzhn, please look into this.

@WildByDesign, just to be clear, is the upstream just another popular DNS service or a custom server? Perhaps the upstream is making the results uncacheable?

ainar-g avatar Jan 22 '24 14:01 ainar-g

When I was testing it for a few weeks on child devices, I was testing it with OpenDNS Family Shield (208.67.222.123 and 208.67.220.123) as custom upstreams. Once I realized it wasn't working 100%, I moved on to testing it with my own Windows laptop and iPhone using Cloudflare DNS (1.1.1.1 and 1.0.0.1) for custom upstreams.

No custom servers. Also, no DoT or DoH. It was just plain DNS to keep the testing as simple as possible.

WildByDesign avatar Jan 22 '24 16:01 WildByDesign

@WildByDesign, just in case, OpenDNS is one of the servers with known cacheability issues. See AdguardTeam/KnowledgeBaseDNS#16.

ainar-g avatar Jan 22 '24 16:01 ainar-g

That is interesting. Although the issue was identical with Cloudflare DNS.

The part that I don't understand is, there seems to be a difference between how the custom upstream cache functions versus how it functions for the main cache)

What I mean is, when I disable the custom upstreams, I place those same DNS domains (and only those) as the main DNS servers and using those exact same servers it caches nearly everything. For example, right now (not using custom upstreams) approx. 4% of the DNS queries are being sent upstreams and therefore approx. 96% is being served from cache.

With the custom upstreams cache, those numbers are almost reversed. The caching for the custom upstreams is working, but with very low percentage of caching. So there seems to be some difference in caching even when using the same DNS providers.

WildByDesign avatar Jan 22 '24 17:01 WildByDesign

@WildByDesign, could you please collect verbose logs and send them along with your configuration file to [email protected]?

schzhn avatar Jan 23 '24 14:01 schzhn

@schzhn Yes, I can collect some verbose logs and send them to you. I am just waiting to get some time without any other devices on the network to keep the logs less noisy. I will likely have to do it at night time.

Question: Does the custom upstreams cache utilize Optimistic caching (cache_optimistic) feature?

WildByDesign avatar Jan 24 '24 12:01 WildByDesign

@WildByDesign, custom upstream cache does not support optimistic caching.

schzhn avatar Jan 24 '24 14:01 schzhn

I have figured out what is happening. I didn't want to bother you guys with verbose logs unless I had a better understanding of the situation. So I spent a few more hours digging into it.

TL;DR:

It definitely has to do with Optimistic Caching (or lack thereof). I had been used to using Optimistic Caching for the last year or so.

OS Caching and DNS TTL Expiry:

~100% of requests sent upstream / ~0% served from cache: AGH > iOS System Cache > Safari > DNS TTL Expired > Hits up custom upstreams Rinse and repeat the continuous cycle

~96% of requests sent upstream / ~4% served from cache: AGH > Windows System Cache > Edge/Chrome > DNS TTL Expired > Hits up custom upstreams Rinse and repeat the continuous cycle

What was happening on iOS?

On iOS, the system cache was caching the DNS requests and therefore not reaching out to AGH until those TTLs expired. On AGH, those TTLs also had expired and therefore would have to hit up the custom upstreams again.

So basically web browsing (Safari) on iOS almost never utilizes AGH cache.

Was I able to get iOS to trigger requests from cache from AGH?

Using a dig app on iOS properly utilizes AGH cache.

The only was that I was able to trigger any kind of serving of cache from AGH on iOS Safari for normal web browsing was to literally power down the iPhones and iPads every 10 minutes. Power them back on. Every 10 minutes.

But under normal circumstances, iPhones would likely be on and running all day long and therefore never be able to utilize AGH caching. 100% of requests get sent to upstreams because those are the requests in which TTLs had expired.

What was happening on Windows?

Essentially the same thing as iOS, but it was slightly better.

Conclusion:

This is not an AGH bug. Custom upstreams cache is working as expected. However, it is nearly impossible to utilize due to system caching and DNS TTL expiry.

AdGuardHome could definitely benefit from Optimistic Cache and custom min/max TTL overrides for the new custom upstreams cache feature.

WildByDesign avatar Jan 25 '24 15:01 WildByDesign

i.e. if I have 5 hosts, they all use 192.168.0.1 as their custom upstream, but they'rea all configured as different clients, then you've got 5 times you've got to enable "custom upstream cache" and that cache, even though it's the same upstream, doesn't seem to be shared amoungst those clients. You have 5 seperate caches, not 1 shared pool cache for 192.168.0.1

Surprised to read this. I just found most records are not cached anymore and hurried to enable the checkboxes and tested it:

upstreams_cache_enabled - this option, carefully hidden at the very bottom, is off by default which means it silently disables the cache by upgrading the software, Should have been on, if the main cache is on. When on, caching starts. upstreams_cache_size - this option is 0 by default, can't be deleted (to force default), but upon testing, 0 seems to enforce the default as well. So no need to fill this value for hundreds of clients.

Now there's a new need for a concentrator to cache the same records different clients post to the same upstreams all day. What's the point of this deglobalization into future if I may ask? I can't see a reason now (more memory, connections, and time spent on tickets). That time could be invested to the new client tab desperately missing: blocklists per client.

gitthangbaby avatar Feb 10 '24 02:02 gitthangbaby

I suspect they have done it like this thinking only certainly /24's (For example) will be configured to speak to a certain upstream.

They probably have not accounted for people who like to add a record for each client for different settings, and that each of those clients will use an upstream cache different to the default.

Which I can understand, I think the way I use it myself is an odd use case.

My "default" is a public DNS server, because I allow by default anyone to hit my Adguard server on the TLS port. This works for me when I'm out roaming on my mobile phone, when I lookup the IP Address of my home webserver, I get returned the public IP.

However when I am at home, or on a VPN, those clients get a custom DNS server. That custom server is just my home router, which returns the internal IP of the webserver etc etc.

I know the other solution people will propose to resolve this which is "Don't do split horizon DNS, just do hairpin NAT instead" and yes, Hairpin NAT is a solution to this, but it's ugly and I'd rather not it.

I suspect my use cache of split-horizon DNS is very non-standard though, so sort of understand why caching for upstreams aren't combined - though of course it does make sense that they are.

tjharman avatar Feb 10 '24 03:02 tjharman

I don't use DNS setting for this, either they're on VPN contacting Adguard nonstop, or just ask public DDNS (allowing a subset of records), a current VPN will discretely forward to reverse proxy which will double check the records, and firewall will predict if they're my clients. All entry and exit points are random failover VPN IPs so i don't see any unknown IPs/probes/attacks at all, in years. Clients never need to change DNS or URL setting in apps, no mattter where they are. They can stay on our VPN inside, that gets translated. On exit, they still go via a VPN balancer. If they wanted to change DNS or VPN while roaming, a hidden script will quickly put them in the correct setting. Can't also do any other VPN, Tor, proxy, DNS, DOH, DOT bypass, and Adguard+Adguard client also partially enforces it.

What I wonder: In what case, we need address google.com, resolved by 1.1.1.1, to be fetched many times per each client? I assumed this is correct: client1 google.com 1.1.1.1 -> ip ttl 300 fetched client2 google.com 1.1.1.1 -> ip ttl 280 cached client3 google.com 1.1.1.1 -> ip ttl 120 cached client1 google.com 1.1.1.1 -> ip ttl 80 cached client2 google.com 1.1.1.1 -> ip ttl 20 cached client3 google.com 1.1.1.1 -> ip ttl 0 fetched

My scenario is simply to split a group of users to sensitive (security DNS), and normal (unfiltered DNS). Just two definitions of upstreams. If there's any hit, I want the record to be available in 1ms (=cached) to the same group too within TTL. If the query results is the same, there's no point in separation.

gitthangbaby avatar Feb 10 '24 05:02 gitthangbaby

(more memory, connections, and time spent on tickets)

Ever since turned this feature on, the new client cache also randomly returns empty cached responses:

Response details Status Processed DNS server tls://security.cloudflare-dns.com:853 (served from cache) Elapsed 0.08 ms Response code NOERROR

Speculating if upstreams_cache_size=0 ("default"?) caused it, so updated the 0 values to some value. Which makes me wonder, is this going to allocate tons of duplicit cache memory for each client? Is upstreams_cache_size=0 a default, copied from global setting? Or it means unlimited? Or zero? Or a hardcoded value? I'm getting cache hits with it all of the time. For rollback reference, the last version without this feature is 107.41.

gitthangbaby avatar Feb 12 '24 01:02 gitthangbaby