Upstream latency increase since v0.107.55
Prerequisites
-
[X] I have checked the Wiki and Discussions and found no answer
-
[X] I have searched other issues and found no duplicates
-
[X] I want to report a bug and not ask a question or ask for help
-
[X] I have set up AdGuard Home correctly and configured clients to use it. (Use the Discussions for help with installing and configuring clients.)
Platform (OS and CPU architecture)
Linux, ARM64
Installation
GitHub releases or script from README
Setup
On one machine
AdGuard Home version
0.107.55
Action
Replace the following command with the one you're calling or a description of the failing action:
N/A
Expected result
Upstream latency should be in line with my ISP WAN connection (FTTP/fibre), approx 12ms.
Actual result
Since upgrading to 0.107.55 the upstream latency has drifted, a fair amount. I am aware that the release is intended to address latency issues (eg #6818), ironically.
On previous versions (eg 0.107.54) my upstream latency across 24 hours is approximately 12ms-14ms for most upstreams. After upgrading to 0.107.55 the upstreams are between 17ms and 32ms. I have verified this multiple times, on the same home LAN with the same clients and (roughly) the same type of traffic. I have tested on my primary LAN DNS server (bare metal, Radxa Rock5B, Debian Bookworm aarch64), and also confirmed the same issue on my backup instance (Alpine Linux LXC on Proxmox, AMD Ryzen x86_64 host).
Downgrading to 0.107.54 immediately fixes the issue, and upstream latency returns to around 12ms. I have pprof running, and can switch versions and/or collect stats and logs as required. Many thanks in advance.
Additional information and/or screenshots
Here is a redacted copy of the AdGuardHome.yaml (extension changed to .txt for Github). AdGuardHome.txt
Thank you for the report - I will pass it on to the devs. Can you confirm if you're using caching or not please? I've tried parsing your uploaded config (thank you) but can't spot it myself.
The DNS cache, you mean?
cache_size: 10000000
cache_ttl_min: 0
cache_ttl_max: 0
cache_optimistic: true
I have a similar issue but this isn't just related to the latest version. I believe it occured a couple of versions before. I have three addresses to which parallel requests are made:
tls://dns.google tls://1dot1dot1dot1.cloudflare-dns.com:853 tls://one.one.one.one:853
Bootstrap DNS addresses are:
1.1.1.1 1.0.0.1 2606:4700:4700::1111 2606:4700:4700::1001
With the above, the average upstream response time is 22ms for Google and 23/24ms for Cloudflare. Removing the Google address automatically and immediately increases the response time to 32ms.
Also, the display of the average upstream response time should be from the quickest at the top to the least quick one at the bottom. Currently, it's the other way round suggesting that the least quick one is the quickest.
I am also here the same and back to v0.108.0-b.59 everything is running normally again it seems all version 5 causes latency increase version 49 is very good
So, here's a screenshot of the response time after the Google DNS server was removed
I'll upload another screenshot once I run the Google DNS with the others for some time.
What I don't understand is how the addition/removal of the Google DNS address changes the response time of the other two in that it increases. The expected behaviour would be to have the same, unless the response time is not calculated the way I think it is.
i am using Version: 0.107.57 and i commented here that it seems to happen easily when using h3:// but rarely when using https:// (http/2).
right now i'm using google DoH with h3://, it's been 3+ hours and everything seems to work just fine.
but i highly suspect it's something to do with how AGH handles query logs (General settings > Logs configuration).
i disabled the query logs, leaving Statistics configuration enabled.
disabled everything there except for the ones that i really needed.
image 1
it seems after disabled the query logging, there's no ping spike. when query logging is enabled, it's easy to get 3000+ ms ping spikes.
in my case, AGH is running on a minipc, intel n100 (so it's not a weak arm cpu like most routers), and i edited the YAML to write the logs in memory instead of using disk. also edited it to write to /dev/shm just to make sure we're not using disk here for the logs.
but no luck, at least until i disabled the Logs configuration feature.
it's probably something to do with how AGH rotates the logs?
i'm hoping anyone who encounters the same issue can check together so the devs can get more data. maybe it wasn't a network issue or a problem from the upstream servers but how AGH handles logs? (😐handling logs does need some compute, especially processing logs rapidly like DNS logs, so it's not impossible).
I'm testing this, will report how it's gone.
i want to follow up my previous comment https://github.com/AdguardTeam/AdGuardHome/issues/7515#issuecomment-3421669150, disabling the Query logs didn't actually solve anything.
highly doubt it's a server-side issue since when using h3:// it'll happen to mainstream providers (google, cloudflare, etc.), their DNS implementations were probably already stable since years ago.
another thing i suspected since months ago was how Go handles quic and http/3, but there are other projects using h3 (i.e. dnscrypt, routedns, etc.) and the last time i used them, the h3 worked fine for days. the ping spikes happened only when i use AGH.
both routedns & dnscrypt are written in Go, again, i doubt myself for suspecting how Go handles quic & h3. it's probably something else that's messing with AGH performance internally.