AdGuardHome icon indicating copy to clipboard operation
AdGuardHome copied to clipboard

DNS-over-QUIC upstream servers no longer work on v0.107.41

Open Freekers opened this issue 7 months ago • 29 comments

Prerequisites

Platform (OS and CPU architecture)

Custom (please mention in the description)

Installation

Docker

Setup

Other (please mention in the description)

AdGuard Home version

v0.107.41

Action

Click 'Test Upstreams'

Expected result

Confirmation that the upstream server is working correctly.

Actual result

Server "quic://XXXXX.dns.nextdns.io": could not be used, please check that you've written it correctly

Additional information and/or screenshots

I'm running two AGH instances. After updating both instances from v0.107.40 to v0.107.41, one instance works fine but on the other one upstream DNS-over-QUIC servers no longer work. The error displayed is: Server "quic://XXXXX.dns.nextdns.io": could not be used, please check that you've written it correctly I also tried using the QUIC server of Adguard but the issue is the same.

Both instances run on Docker. However, the host OS is different. The working instance runs Ubuntu Server 22.04. The broken/non-working instance is running on a Synology NAS (x86_64 GNU/Linux synology_apollolake_918+) I've already deleted the container and repulled the image, but the problem is still there. This DNS-over-QUIC upstream server was working on both instances on v0.107.40

I enabled debug logging and found the following which could be related;

2023/11/14 15:26:31.325506 1#55 [debug] bootstrap: dialing 45.11.106.155:853 (1/4)
2023/11/14 15:26:31.326218 1#55 [debug] bootstrap: connection to 45.11.106.155:853 succeeded in 114.467µs
2023/11/14 15:26:31.328239 1#55 [debug] dnsproxy: upstream quic://XXXXX.dns.nextdns.io:853 failed to exchange ;HsGH2CJwy_JPd3x0T.multi.surbl.org.	IN	 A in 138.551848ms: opening quic connection to quic://XXXXX.dns.nextdns.io:853: INTERNAL_ERROR (local): write udp [::]:44035->45.11.106.155:853: sendmsg: invalid argument
2023/11/14 15:26:31.328496 1#55 [debug] proxy: replying from upstream: opening quic connection to quic://XXXXXX.dns.nextdns.io:853: INTERNAL_ERROR (local): write udp [::]:44035->45.11.106.155:853: sendmsg: invalid argument
2023/11/14 15:26:31.328663 1#55 [debug] dnsforward: finished processing upstream

This issue seems related to: https://github.com/AdguardTeam/AdGuardHome/issues/6301 and https://github.com/AdguardTeam/AdGuardHome/issues/6335 which was resolved in v0.107.40

Freekers avatar Nov 14 '23 14:11 Freekers

Thanks for the report and the logs. I suspect that the kernel version may be the reason for the difference. Can you show the output of uname -a on both machines?

Also, does adding QUIC_GO_DISABLE_ECN=true on the machine with the issue fix it?

ainar-g avatar Nov 14 '23 16:11 ainar-g

Thanks for the report and the logs. I suspect that the kernel version may be the reason for the difference. Can you show the output of uname -a on both machines?

Also, does adding QUIC_GO_DISABLE_ECN=true on the machine with the issue fix it?

Output of uname -a on the working machine:

Linux raptor 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Output of uname -a on the broken machine:

Linux TurboPolyp 4.4.180+ #42962 SMP Mon May 29 14:38:23 CST 2023 x86_64 GNU/Linux synology_apollolake_918+

Where do I add QUIC_GO_DISABLE_ECN=true? Is this an environmental variable? If so, I suppose I would need to enter this inside the Docker container, correct?

Thanks

Freekers avatar Nov 14 '23 23:11 Freekers

Thanks for the info.

Where do I add QUIC_GO_DISABLE_ECN=true? Is this an environmental variable? If so, I suppose I would need to enter this inside the Docker container, correct?

Yes, this should be set in the container's environment. The AdGuardHome binary should be able to observe the value of that environment variable.

ainar-g avatar Nov 15 '23 10:11 ainar-g

Thanks for the info.

Where do I add QUIC_GO_DISABLE_ECN=true? Is this an environmental variable? If so, I suppose I would need to enter this inside the Docker container, correct?

Yes, this should be set in the container's environment. The AdGuardHome binary should be able to observe the value of that environment variable.

I have set the environment variable inside the container as follows:

docker exec -it adguard sh
/opt/adguardhome/work # echo $QUIC_GO_DISABLE_ECN

/opt/adguardhome/work # export QUIC_GO_DISABLE_ECN=true
/opt/adguardhome/work # echo $QUIC_GO_DISABLE_ECN
true

But sadly it does not fix the issue (same error message). I also tried using the edge image, same issue.

Freekers avatar Nov 15 '23 12:11 Freekers

If you're running AGH with something like docker run, you should use the -e/--env.

ainar-g avatar Nov 15 '23 13:11 ainar-g

If you're running AGH with something like docker run, you should use the -e/--env.

Oops, my bad, you're right. I've now set the environmental variable in my docker-compose file as follows:

services:
  adguard:
   image: adguard/adguardhome:latest
   restart: always
   container_name: adguard
   network_mode: "host"
   environment:
    - TZ=Europe/Amsterdam
    - QUIC_GO_DISABLE_ECN=true
   volumes:
    - /volume1/docker/adguard/work:/opt/adguardhome/work
    - /volume1/docker/adguard/conf:/opt/adguardhome/conf

I can confirm that the issue is now resolved. The QUIC upstream DNS server now works again, thank you.

What does this setting QUIC_GO_DISABLE_ECN=true do exactly?

Thanks

Freekers avatar Nov 15 '23 13:11 Freekers

What does this setting QUIC_GO_DISABLE_ECN=true do exactly?

It disables additional congestion-control features added to quic-go in v0.39.0.

It's good that the workaround works, but it's still weird, as AGH v0.107.41 uses quic-go v0.39.2, which should have fixed the sendmsg: invalid argument issue. Perhaps Synology has a weird kernel build.

@marten-seemann, is there any way we could debug this further?

ainar-g avatar Nov 15 '23 14:11 ainar-g

What does this setting QUIC_GO_DISABLE_ECN=true do exactly?

It disables additional congestion-control features added to quic-go in v0.39.0.

@ainar-g It turns out @marten-seemann only patched this for FreeBSD, AMD64 (aka x86_64) Environment. https://github.com/AdguardTeam/AdGuardHome/issues/6301

Users of Asuswrt-Merlin routers are also experiencing this issue:-https://www.snbforums.com/threads/adguardhome-new-releases-2023.85191/post-875540. As a temporary fix, I plan to add the QUIC_GO_DISABLE_ECN=true option to the Env variable PREARGS until adequate fix has been provided.

Here is an example of the environment of Asuswrt-Merlin Routers:

ASUSWRT-Merlin RT-AX88U_PRO 3004.388.4_0 Mon Aug 21 19:34:19 UTC 2023 admin@RT-AX88U_Pro-29B8:/tmp/home/root# uname -a Linux RT-AX88U_Pro-29B8 4.19.183 #1 SMP PREEMPT Mon Aug 21 15:34:46 EDT 2023 aarch64 ASUSWRT-Merlin

HTH

jumpsmm7 avatar Nov 15 '23 23:11 jumpsmm7

How would the cmsg look on other platforms? Would be good to fix this in quic-go, the env is just an escape hatch and shouldn’t be a permanent solution.

marten-seemann avatar Nov 16 '23 09:11 marten-seemann

@jumpsmm7, you're pointing to the FreeBSD issue, but all Linux platforms should have been fixed in #6335. See quic-go/quic-go#4127.

As for the control message, I'm leaning towards this being a change in the Linux kernel somewhere around v5, since so far this seems to affect only those with kernels in the v4.x branch, but I don't have any sold proofs just yet.

ainar-g avatar Nov 16 '23 11:11 ainar-g

@marten-seemann, another theory I've had is that the issue could have something to do with how quic-go sets IP_TOS/IP6_TCLASS depending on whether or not an IP address is convertible to IPv4 rather than checking for the socket family. It could also be dependent on sysctl net.ipv6.bindv6only, although I cannot reproduce any errors either way on my Ubuntu with v5.15.0 kernel. I've seen some C code that just sets both, too, but I'm not sure if that's the correct solution.

ainar-g avatar Nov 16 '23 12:11 ainar-g

I cannot creat an issue in GitHub mobile client, all be teleport to discord.

That's the same thing I face in my old arm-v7 android device. (uname -a linux 3.4.39 armv7)

FNsi avatar Nov 18 '23 01:11 FNsi

So this might just be due to ancient kernels. Is anyone aware of a way to detect support for these cmsgs, ideally without parsing kernel version numbers?

marten-seemann avatar Nov 18 '23 05:11 marten-seemann

@marten-seemann, my guess would be that getting this EINVAL is the way. Perhaps, the code should send the message with the ECN data, check if the error is EINVAL, and, if it is, retry sending without the ECN data. If that second send succeeds, ECN is likely not supported in the kernel.

Also, as a related question, are there any plans to allow library clients to disable ECN through the Config structure? Using setenv to configure a library isn't exactly ideal, and there may be some clients who want to disable the feature regardless of the support.

ainar-g avatar Nov 20 '23 10:11 ainar-g

@marten-seemann, my guess would be that getting this EINVAL is the way. Perhaps, the code should send the message with the ECN data, check if the error is EINVAL, and, if it is, retry sending without the ECN data. If that second send succeeds, ECN is likely not supported in the kernel.

We already have similar logic for GSO: https://github.com/quic-go/quic-go/blob/3bf2e19d0dc617135ec9d6f3c5191740a27097c7/send_conn.go#L62-L68. I assume we could build something similar for EINVAL, but it's a bit unfortunate too much such an unspecific error code.

Also, as a related question, are there any plans to allow library clients to disable ECN through the Config structure? Using setenv to configure a library isn't exactly ideal, and there may be some clients who want to disable the feature regardless of the support.

What's the use case for that?

marten-seemann avatar Nov 20 '23 11:11 marten-seemann

What's the use case for that?

Situations where the developers know that the software is likely to be run on older/modified kernels without proper ECN support.

ainar-g avatar Nov 20 '23 12:11 ainar-g

I can confirm that HTTP/3 doesn't work in Synology Docker under v0.107.43 Setting the env variable as advised above resolved the issue.

$ uname -a
Linux DS920 4.4.59+ #25556 SMP PREEMPT Tue Mar 21 22:25:44 CST 2023 x86_64 GNU/Linux synology_geminilake_920+
2023/12/11 20:41:36.202227 1#47 [debug] dnsproxy: https://cloudflare-dns.com:443/dns-query: response received over udp: "requesting https://cloudflare-dns.com:443/dns-query: Get_0rtt \"https://cloudflare-dns.com:443/dns-query?dns=AAABAAABAAAAAAAABHRlc3QAAAEAAQ\": INTERNAL_ERROR (local): write udp [::]:40657->104.16.248.249:443: sendmsg: invalid argument"

ardel avatar Dec 11 '23 20:12 ardel

I'd need some more hints debugging this. It's really hard to make any fixes if I can't reproduce this locally.

I already installed Ubuntu 18.04 in a VM (4.15.0-213-generic on aarch64), but everything works fine here.

marten-seemann avatar Dec 13 '23 14:12 marten-seemann

Can someone try reproducing it in Ubuntu 16.04 that has 4.4 kernel? https://wiki.ubuntu.com/XenialXerus/ReleaseNotes

ardel avatar Dec 13 '23 22:12 ardel

I guess it should be a problem earlier than 4.14 Mine is 3.4. Others in this issue are 4.4. , 4.5.

FNsi avatar Dec 14 '23 03:12 FNsi

I'm unable install Ubuntu 16.04 due to some weird virtualization errors, both in UTM and in Parallels. The earliest version I can install is 18.04.

marten-seemann avatar Dec 14 '23 03:12 marten-seemann

I managed to run Ubuntu 14.04 and the "sendmsg: invalid argument" reproduces there.

It looks like the change we introduced in response to https://github.com/AdguardTeam/AdGuardHome/issues/6335 is causing the issue: If I use a 4 byte value for the IP_TOS cmsg, it works on old kernels (despite man 7 ip claiming that IP_TOS is a byte and not a uint32).

Re-reading #6335 I'm not sure anymore why we reduced the cmsg value to 1, other than to be more conformant with what the man page says. Newer versions of Linux seem to accept both values. I'm planning to revert the change (https://github.com/quic-go/quic-go/pull/4127), unless someone has a better idea how to fix this problem.

marten-seemann avatar Dec 19 '23 04:12 marten-seemann

@marten-seemann, what about this comment? The original reason wasn't just to follow the manual but also because the size was causing reproducible issues that went away after the change to 1.

ainar-g avatar Dec 19 '23 12:12 ainar-g

I wasn't able to reproduce this failure. Maybe it only occurs on MIPS? Frankly, properly supporting amd64 and arm64 on all kernel versions is more important than other architectures, and we could disable ECN on mips altogether.

marten-seemann avatar Dec 19 '23 13:12 marten-seemann

It's definitely not MIPS-only, because I ran the test on a machine running AMD64, and so did a lot of people for whom size 1 fixed that issue.

Considering that 4 is the size of an IPV6_TCLASS message, are you sure that the issue isn't that an IPv6 socket is receiving mapped IPv4 queries and thus there is a protocol mismatch, as I've described previously? Judging by some questions (like this and this), it was one of the things that had changed between 16.04 and 18.04.

ainar-g avatar Dec 19 '23 16:12 ainar-g

Yes. Please try out 14.04, size 1 fails there reliably, whereas size 4 works reliably. Size 1 seems to continue causing problems, see https://github.com/quic-go/quic-go/issues/4178 for example.

marten-seemann avatar Dec 19 '23 16:12 marten-seemann

I'm getting no errors with our dnsproxy (using [email protected]) and a QUIC upstream on qemu with Ubuntu 16.04 (kernel 4.4, like a few people here have). Can you post which code you're currently using to test this?

ainar-g avatar Dec 19 '23 17:12 ainar-g

I wasn't able to reproduce it with 16.04, only with 14.04.

You can use the example client in the quic-go repo: go run example/client/main.go https://google.com. That should be sufficient to trigger the error.

marten-seemann avatar Dec 20 '23 09:12 marten-seemann

Hey there!

Trying to run a DoQ server both with latest release and latest beta ( v0.108.0-b.52) on port 853 within OPNSense 24.1_1 (FreeBSD OPNsense.home 13.2-RELEASE-p9 FreeBSD 13.2-RELEASE-p9 stable/24.1-n254969-8659880248c SMP amd64), but I'm currently running into the same issue.

Trying to run with ECN disabled (sudo QUIC_GO_DISABLE_ECN=true /usr/local/AdGuardHome/AdGuardHome -c /usr/local/AdGuardHome/AdGuardHome.yaml -v) still gives the following output:

2024/02/04 14:08:53.187286 94054#4715 [error] accepting quic stream: INTERNAL_ERROR (local): write udp [::]:853->192.168.1.252:60887: sendmsg: invalid argument
2024/02/04 14:08:53.187354 94054#4715 [debug] closing quic conn 192.168.1.1:853 with code 0

I'm currently testing with kdig and the following command kdig -d +quic -p 853 -t A @192.168.3.1 gitlab.com and the following output:

;; DEBUG: Querying for owner(gitlab.com.), class(1), type(1), server(192.168.3.1), port(853), protocol(UDP)
;; WARNING: QUIC, peer took too long to respond
;; DEBUG: retrying server 192.168.3.1@853(UDP)
;; WARNING: QUIC, peer took too long to respond
;; DEBUG: retrying server 192.168.3.1@853(UDP)
;; WARNING: QUIC, peer took too long to respond
;; ERROR: failed to query server 192.168.3.1@853(UDP)

If there's anything else I can provide to help with debugging, please let me know!

ToasterDEV avatar Feb 04 '24 20:02 ToasterDEV

@Freekers is this still an issue with the most recent version?

overwatch3560 avatar Apr 20 '24 13:04 overwatch3560