AdGuardHome icon indicating copy to clipboard operation
AdGuardHome copied to clipboard

Split cache based on ECS

Open L8X opened this issue 2 years ago • 23 comments

Prerequisites

  • [X] I have checked the Wiki and Discussions and found no answer

  • [X] I have searched other issues and found no duplicates

  • [X] I want to report a bug and not ask a question

Operating system type

Linux, Other (please mention the version in the description)

CPU architecture

AMD64

Installation

GitHub releases or script from README

Setup

On one machine

AdGuard Home version

v0.108.0-a.540+757ddb06

Description

What did you do?

I used an ECS enabled Cloudflare resolver (I use Gateway with no filters on) to allow CDNs to return DNS records more relevant to the locations of clients using my resolver, with the DNS cache enabled to speed up resolution and prevent the need to ask the upstream resolver for the DNS records every time.

Expected result

  1. Client A who is in the UK always gets UK / EU based IPs delivered by the resolver and cache

  2. Client B who is in the US always gets US / CA based IPs delivered by the resolver and cache

Actual result

  1. If Client A requests the same domain as Client B, and the DNS cache is enabled, if Client A gets answers first, the DNS cache caches the UK / EU IPs and then serves them to Client B instead of asking the resolver again and caching the returned US / CA based ones for Client B.

  2. If Client B requests the same domain as Client A, and the DNS cache is enabled, if Client B gets answers first, the DNS cache caches the US / CA IPs and then serves them to Client A instead of asking the resolver again and caching the returned UK / EU based ones for Client A.

Screenshots (if applicable)

diagram

Additional information

Please see the image above for an example of what happens.

L8X avatar Apr 20 '23 12:04 L8X

I cannot reproduce this with the 8.8.8.8 upstream, which seems to process ECS mostly correctly:

dig IN A 'www.google.com' +subnet='192.0.0.0/16'
…
;; ANSWER SECTION:
www.google.com.         300     IN      A       64.233.164.147
www.google.com.         300     IN      A       64.233.164.106
www.google.com.         300     IN      A       64.233.164.103
www.google.com.         300     IN      A       64.233.164.104
www.google.com.         300     IN      A       64.233.164.99
www.google.com.         300     IN      A       64.233.164.105
…
dig IN A 'www.google.com' +subnet='151.101.0.0/16'
…
;; ANSWER SECTION:
www.google.com.         300     IN      A       142.250.179.196
…

What is the “ECS enabled Cloudflare resolver”, exactly? Have you verified that it's using the provided ECS data using dig? Because 1.1.1.1 doesn't.

ainar-g avatar Apr 20 '23 13:04 ainar-g

I cannot reproduce this with the 8.8.8.8 upstream, which seems to process ECS mostly correctly:

dig IN A 'www.google.com' +subnet='192.0.0.0/16'
…
;; ANSWER SECTION:
www.google.com.         300     IN      A       64.233.164.147
www.google.com.         300     IN      A       64.233.164.106
www.google.com.         300     IN      A       64.233.164.103
www.google.com.         300     IN      A       64.233.164.104
www.google.com.         300     IN      A       64.233.164.99
www.google.com.         300     IN      A       64.233.164.105
…
dig IN A 'www.google.com' +subnet='151.101.0.0/16'
…
;; ANSWER SECTION:
www.google.com.         300     IN      A       142.250.179.196
…

What is the “ECS enabled Cloudflare resolver”, exactly? Have you verified that it's using the provided ECS data using dig? Because 1.1.1.1 doesn't.

Cloudflare Gateway. It's their enterprise focused DNS firewall solution.

Yes, I literally used dig in my tests, if you check the image I posted in the first screenshot you'll see exactly what the issue is.

ghost avatar Apr 20 '23 14:04 ghost

I cannot reproduce this with the 8.8.8.8 upstream, which seems to process ECS mostly correctly:

dig IN A 'www.google.com' +subnet='192.0.0.0/16'
…
;; ANSWER SECTION:
www.google.com.         300     IN      A       64.233.164.147
www.google.com.         300     IN      A       64.233.164.106
www.google.com.         300     IN      A       64.233.164.103
www.google.com.         300     IN      A       64.233.164.104
www.google.com.         300     IN      A       64.233.164.99
www.google.com.         300     IN      A       64.233.164.105
…
dig IN A 'www.google.com' +subnet='151.101.0.0/16'
…
;; ANSWER SECTION:
www.google.com.         300     IN      A       142.250.179.196
…

What is the “ECS enabled Cloudflare resolver”, exactly? Have you verified that it's using the provided ECS data using dig? Because 1.1.1.1 doesn't.

I believe you are not understanding the issue, the AGH DNS cache does not seem to be separating A / AAAA records being sent to clients.

If a US client (Client B in my example) requests a domain which has ECS enabled, AGH caches the result (which has US IPs), and for some reason doesn't re-request the domain for a client with differently geolocated ECS data, and gives it the US results. Vice versa.

End goal would be to have the DNS Cache completely separate queries based on the geolocation of ECS data so that Client A and B always get to use the DNS cache of AGH as well as not clogging each other up with incorrectly returned IPs.

If you're saying it ALREADY does this, then the feature is simply bugged and isn't working correctly.

If it is separate, why does my resolver give results for something I gave US ECS data to from a UK client sending UK ECS data?

ghost avatar Apr 20 '23 14:04 ghost

This is the problem, it only happens when the AGH DNS cache feature is ENABLED.

The upstream sends differently geolocated IPs, the AGH DNS cache just isn't separating the results for clients based on their ECS data like the upstream does.

diagram

ghost avatar Apr 20 '23 15:04 ghost

The screenshot doesn't provide the information we'd need to reproduce the issue. Among other things:

  1. whether or not the clients themselves have sent the ECS option or not;
  2. which IP address AdGuard Home is detecting for these clients;
  3. what output did the upstream provide (e.g. whether or not there is a valid subnet there).

There may be a bug in there, but in order to fix it we need to know the actual conditions under which they occur. Besides these commands you can also send your configuration file and the verbose logs of AdGuard Home startup and these queries being made to [email protected]. (Please add “AdGuard Home Issue 5757” to the subject line if you do.)

ainar-g avatar Apr 20 '23 15:04 ainar-g

The screenshot doesn't provide the information we'd need to reproduce the issue. Among other things:

  1. whether or not the clients themselves have sent the ECS option or not;
  2. which IP address AdGuard Home is detecting for these clients;
  3. what output did the upstream provide (e.g. whether or not there is a valid subnet there).

There may be a bug in there, but in order to fix it we need to know the actual conditions under which they occur. Besides these commands you can also send your configuration file and the verbose logs of AdGuard Home startup and these queries being made to [email protected]. (Please add “AdGuard Home Issue 5757” to the subject line if you do.)

  1. The clients sent the subnets themselves
  2. [REDACTED] which is based in the UK, and [REDACTED] which is based in the US
  3. (See the screenshot for examples) The upstream provided a US IP to the US client and a UK IP to the UK client with the cache DISABLED, but with it enabled, it depends which client requests the domain first after the cache for it expires.

I believe the issue either lies with sending the ECS data while the cache is turned on, OR that the cache is having separation issues.

ghost avatar Apr 20 '23 15:04 ghost

Can reproduce the problem with AGH DNS cache enabled. AGH server runs on: 192.168.100.254 with ECS enabled.

image

dig IN A www.google.com +subnet=192.0.0.0/16 @192.168.100.254
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         10      IN      A       142.251.42.228

image

dig IN A www.google.com +subnet=118.0.0.0/16 @192.168.100.254
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         245     IN      A       142.251.42.228

LittleJake avatar Apr 22 '23 08:04 LittleJake

@ainar-g Any chance of getting this bug fixed ASAP?

ghost avatar Apr 27 '23 00:04 ghost

Once again, we cannot reproduce it with a compliant upstream.

(See the screenshot for examples) The upstream provided a US IP to the US client and a UK IP to the UK client with the cache DISABLED, but with it enabled, it depends which client requests the domain first after the cache for it expires.

The screenshot does not provide the information. The output of the query to 8.8.8.8:

dig IN A 'www.google.com' +subnet='192.0.0.0/16' @8.8.8.8
[…]
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; CLIENT-SUBNET: 192.0.0.0/16/16
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         300     IN      A       64.233.165.105
www.google.com.         300     IN      A       64.233.165.106
www.google.com.         300     IN      A       64.233.165.147
www.google.com.         300     IN      A       64.233.165.103
www.google.com.         300     IN      A       64.233.165.104
www.google.com.         300     IN      A       64.233.165.99
[…]

Observe that there is the CLIENT-SUBNET part in the response. Now compare that to the public Cloudflare DNS:

dig IN A 'www.google.com' +subnet='192.0.0.0/16' @1.1.1.1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         67      IN      A       142.251.1.147
www.google.com.         67      IN      A       142.251.1.99
www.google.com.         67      IN      A       142.251.1.105
www.google.com.         67      IN      A       142.251.1.103
www.google.com.         67      IN      A       142.251.1.104
www.google.com.         67      IN      A       142.251.1.106

Notice how the OPT pseudosection doesn't contain the CLIENT-SUBNET part. Without it, AdGuard Home cannot know, for which subnet this response is valid.

ainar-g avatar Apr 27 '23 11:04 ainar-g

Once again, we cannot reproduce it with a compliant upstream.

(See the screenshot for examples) The upstream provided a US IP to the US client and a UK IP to the UK client with the cache DISABLED, but with it enabled, it depends which client requests the domain first after the cache for it expires.

The screenshot does not provide the information. The output of the query to 8.8.8.8:

dig IN A 'www.google.com' +subnet='192.0.0.0/16' @8.8.8.8
[…]
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; CLIENT-SUBNET: 192.0.0.0/16/16
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         300     IN      A       64.233.165.105
www.google.com.         300     IN      A       64.233.165.106
www.google.com.         300     IN      A       64.233.165.147
www.google.com.         300     IN      A       64.233.165.103
www.google.com.         300     IN      A       64.233.165.104
www.google.com.         300     IN      A       64.233.165.99
[…]

Observe that there is the CLIENT-SUBNET part in the response. Now compare that to the public Cloudflare DNS:

dig IN A 'www.google.com' +subnet='192.0.0.0/16' @1.1.1.1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         67      IN      A       142.251.1.147
www.google.com.         67      IN      A       142.251.1.99
www.google.com.         67      IN      A       142.251.1.105
www.google.com.         67      IN      A       142.251.1.103
www.google.com.         67      IN      A       142.251.1.104
www.google.com.         67      IN      A       142.251.1.106

Notice how the OPT pseudosection doesn't contain the CLIENT-SUBNET part. Without it, AdGuard Home cannot know, for which subnet this response is valid.

@ainar-g Then don't base it on the pseudosection, base it on something else.

You're not using a compliant resolver in your examples, 1.1.1.1 is not ECS enabled, you should use 9.9.9.11 instead, or use Cloudflare Gateway as I have said many times is what I use.

Also, I assure you this bug exists and is a huge problem, a simple solution would be to make AGH separate the cache based on subnets themselves that clients send, rather than relying on whatever is in place now.

I don't see why you're having issues realizing the problem.

ghost avatar Apr 27 '23 11:04 ghost

Once again, we cannot reproduce it with a compliant upstream.

(See the screenshot for examples) The upstream provided a US IP to the US client and a UK IP to the UK client with the cache DISABLED, but with it enabled, it depends which client requests the domain first after the cache for it expires.

The screenshot does not provide the information. The output of the query to 8.8.8.8:

dig IN A 'www.google.com' +subnet='192.0.0.0/16' @8.8.8.8
[…]
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; CLIENT-SUBNET: 192.0.0.0/16/16
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         300     IN      A       64.233.165.105
www.google.com.         300     IN      A       64.233.165.106
www.google.com.         300     IN      A       64.233.165.147
www.google.com.         300     IN      A       64.233.165.103
www.google.com.         300     IN      A       64.233.165.104
www.google.com.         300     IN      A       64.233.165.99
[…]

Observe that there is the CLIENT-SUBNET part in the response. Now compare that to the public Cloudflare DNS:

dig IN A 'www.google.com' +subnet='192.0.0.0/16' @1.1.1.1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         67      IN      A       142.251.1.147
www.google.com.         67      IN      A       142.251.1.99
www.google.com.         67      IN      A       142.251.1.105
www.google.com.         67      IN      A       142.251.1.103
www.google.com.         67      IN      A       142.251.1.104
www.google.com.         67      IN      A       142.251.1.106

Notice how the OPT pseudosection doesn't contain the CLIENT-SUBNET part. Without it, AdGuard Home cannot know, for which subnet this response is valid.

Also, relating to the client subnet part of your argument, this is invalid, as even using Google DNS doesn't let the cache separate for me, so your argument is nullified based on my findings.

Just implement a new method of validation, stop relying on the ECS response, not all resolvers even return their ECS headers to begin with, some even overwrite it with their own, so this is UNRELIABLE and the fundamentals for this feature MUST be changed if you're going to address this bug.

Simple solution is to just track and tag responses based on client ECS information, and separate the cache accordingly, and only return cached responses relevant to said tags.

ghost avatar Apr 27 '23 12:04 ghost

Once again, we cannot reproduce it with a compliant upstream.

(See the screenshot for examples) The upstream provided a US IP to the US client and a UK IP to the UK client with the cache DISABLED, but with it enabled, it depends which client requests the domain first after the cache for it expires.

The screenshot does not provide the information. The output of the query to 8.8.8.8:

dig IN A 'www.google.com' +subnet='192.0.0.0/16' @8.8.8.8
[…]
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; CLIENT-SUBNET: 192.0.0.0/16/16
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         300     IN      A       64.233.165.105
www.google.com.         300     IN      A       64.233.165.106
www.google.com.         300     IN      A       64.233.165.147
www.google.com.         300     IN      A       64.233.165.103
www.google.com.         300     IN      A       64.233.165.104
www.google.com.         300     IN      A       64.233.165.99
[…]

Observe that there is the CLIENT-SUBNET part in the response. Now compare that to the public Cloudflare DNS:

dig IN A 'www.google.com' +subnet='192.0.0.0/16' @1.1.1.1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         67      IN      A       142.251.1.147
www.google.com.         67      IN      A       142.251.1.99
www.google.com.         67      IN      A       142.251.1.105
www.google.com.         67      IN      A       142.251.1.103
www.google.com.         67      IN      A       142.251.1.104
www.google.com.         67      IN      A       142.251.1.106

Notice how the OPT pseudosection doesn't contain the CLIENT-SUBNET part. Without it, AdGuard Home cannot know, for which subnet this response is valid.

Also I just tested this with Google, it doesn't send me UK IPs when sending a UK subnet to an Akamai powered CDN, just how I explained in my last issue, no mainstream ECS resolver is implementing the standard correctly and go off the remote_addr of the requester 🤣

I suggest you sign yourself up to Cloudflare's Zero Trust feature and start using Cloudflare Gateway (ensure ECS is enabled) and you'll soon realize they implement it correctly.

ghost avatar Apr 27 '23 12:04 ghost

@ainar-g Any updates on this?

ghost avatar May 07 '23 12:05 ghost

I have the same issue here, v0.108.0-b.34 with Cloudflare Gateway DOH (ECS enabled) as upstream

This problem can be reproduced by dig txt whoami.ds.akahelp.net +subnet= with a different subnet and cache on/off.

Here are screenshots with cache enabled image image image

WoadZS avatar May 09 '23 01:05 WoadZS

@ainar-g bump

ghost avatar Jul 19 '23 15:07 ghost

This issue still persists.

ghost avatar Jul 19 '23 15:07 ghost

There is no update, and this is essentially a feature request for a geo-split cache that take the ECS into account. Retitling and retagging as such.

ainar-g avatar Jul 20 '23 11:07 ainar-g

Is there any new development?

zijiren233 avatar Apr 27 '24 06:04 zijiren233

"I've encountered the same issue. I have a VPS each in the United States and Europe. Additionally, there's another VPS in Singapore with ADH installed. The upstream DNS is set to 223.5.5.5, with ECS enabled (223.5.5.5 fully supports ECS). When resolving www.google.com on the VPS in the United States, the resolved IP belongs to Google's US IP range, referred to as A, and the result is cached by ADH. At this point, when resolving www.google.com through the VPS in Europe, the result is also A. However, if the VPS in Europe resolves www.google.com first, the IP obtained is from Google's European IP range, referred to as B. Then, when the VPS in the United States attempts to resolve www.google.com, the obtained IP is also B."---------Translated by ChatGPT QQ截图20240509224311 QQ截图20240509224337 Also test dig txt whoami.ds.akahelp.net QQ截图20240509225245 QQ截图20240509225346 QQ截图20240509225413

Loukky avatar May 09 '24 14:05 Loukky

@ainar-g Any information new?

Loukky avatar Jun 18 '24 13:06 Loukky

There are no updates. The original issue still has zero :+1: reactions, and the changes that would be required here are quite substantial.

ainar-g avatar Jun 18 '24 17:06 ainar-g

There are no updates. The original issue still has zero 👍 reactions, and the changes that would be required here are quite substantial.

Why need 👍 reaction, This bug does exist ... We have described it very clearly. You just need to follow it to reproduce the phenomenon.

Loukky avatar Jun 18 '24 17:06 Loukky

It has been explained before. This issue isn't about bug, and the ECS cache works based on the response, as it should.

It seems like you have some kind of ECS misconfiguration. This thread provides a log of tools to diagnose it, in particular inspecting the upstream response in the verbose log, and you can ask people to help you Discussions.

ainar-g avatar Jun 18 '24 17:06 ainar-g