https_dns_proxy https_dns_proxy + dnsmasq + systemd-resolved dnssec failure

I'm facing a problem with my DNS resolution chain that lead to dnssec validation failures on some names:

My raspberry pi (192.168.0.254 in the following examples) runs https_dns_proxy on port 5053 to connect to Cloudflare's DOH service. dnsmasq uses the proxy service on port 5053 as upstream DNS server and offers a classic DNS service for the internal network on port 53. So far so normal.

One of my computers (192.168.0.2, aka "grey") running Gentoo ended up using systemd-resolved. I guess this got pushed through systemd defaults somewhen. Via dhcp it learns to talk to said dnsmasq on 192.168.0.254:53. This mostly works, except that some domains fail to resolve, with systemd-resolved complaining abnout dnssec:

$ dig bmeia.gv.at

; <<>> DiG 9.18.31 <<>> bmeia.gv.at
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 7213
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;bmeia.gv.at.                   IN      A

;; Query time: 341 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Sat May 03 15:53:27 EAT 2025
;; MSG SIZE  rcvd: 40

$ resolvectl query bmeia.gv.at
bmeia.gv.at: resolve call failed: DNSSEC validation failed: no-signature (Network Error)

journalctl -f has the following complaint:

May 03 15:52:37 grey systemd-resolved[127851]: [🡕] DNSSEC validation failed for question bmeia.gv.at IN DNSKEY: no-signature
May 03 15:52:37 grey systemd-resolved[127851]: [🡕] DNSSEC validation failed for question bmeia.gv.at IN A: no-signature
May 03 15:52:37 grey systemd-resolved[127851]: [🡕] DNSSEC validation failed for question bmeia.gv.at IN AAAA: no-signature

Almost all other domains work fine, including dnssec validation:

$ resolvectl query go.dnscheck.tools
go.dnscheck.tools: 2a01:4f8:1c1e:84c3::1       -- link: wlan0
                   116.203.95.251              -- link: wlan0

-- Information acquired via protocol DNS in 1.2003s.
-- Data is authenticated: yes; Data was acquired via local or encrypted transport: no
-- Data from: network
stefan@grey ~ $ resolvectl query badsig.go.dnscheck.tools
badsig.go.dnscheck.tools: resolve call failed: DNSSEC validation failed: invalid (DNSSEC Bogus: failed to verify badsig.go.dnscheck.tools. A: using DNSKEY ids = [35243])

The two I found were and bmeia.gv.at and www.thecitizen.co.tz . There might of course be more.

Doing one of the following works around the issue:

Set DNSSEC=no in /etc/systemd/resolved.conf, but this obviously breaks dnssec.
Cut out https_dns_proxy and make dnsmasq talk directly to e.g. 1.1.1.1 port 53 (see the footnote below though)
Cut out dnsmasq and let systemd-resolved talk to https_dns_proxy directly
Cut out systemd-resolved and let the glibc resolver talk to dnsmasq->https_dns_proxy->cloudflare

So the following chain fails: dig -> systemd-resolved -> dnsmasq -> https_dns_proxy -> cloudflare

The following chains work:

dig -> systemd-resolved -> dnsmasq -> cloudflare
dig -> systemd-resolved -> https_dns_proxy -> cloudflare
dig -> dnsmasq -> https_dns_proxy -> cloudflare

The first two of those also show "Data is authenticated: yes" for bmeia.gv.at when I use resolvectl. www.thecitizen.co.tz does not seem to have a dnssec signature. The last one (without systemd-resolved) shows a dnssec record in dig +dnssec for bmeia.gv.at.

Other systems I am using (Android, Mac OS, Linux without systemd-resolved) work fine.

I built https-dns-proxy from source, git commit 0e074b40f3. For the examples here I ran it with ./https_dns_proxy -b 192.168.0.1,1.0.0.1 -r https://cloudflare-dns.com/dns-query -4 -l - -a 0.0.0.0 -p 5053 as root. The normal startup service invokes it as unprivileged user and only allows connections from 127.0.0.1 of course.

Footnote: Any unencrypted DNS traffic is intercepted by my ISP and answered by its own servers to implement a dumb filter to semi-satisfy legal requirements. I don't care too much about the filtering, but the ISP's DNS server is unresponsive every now and then, making my entire connection unstable for stupid reasons. Hence the use of DNS over HTTPS.

May 03 '25 13:05 stefand

Hi Stefan! First of all, thank you for the great problem description! I'm not a dnssec expert at all, and it is really a great question what could go wrong trough the faulty long chain. I'm sorry, but I can't help you out with this issue. I just did not wanted to leave you unanswered! Sadly we know, that there are plenty limitations of this proxy (like no serving over TCP, issues with IPv6, etc.), but it works well in most cases. Best regards, Balázs

May 11 '25 21:05 baranyaib90

Hi, Thanks for the reply and no worries.

The issue is certainly confusing - yesterday I finally migrated my main router to OpenWRT and set up https-dns-proxy there with the builtin packages + LuCI web ui, and interestingly this setup does not have this problem - even though it is the same systemd-resolved -> dnsmasq -> https_dns_proxy -> cloudflare chain. I suspect there's some difference in the dnsmasq settings. I skimmed over it and played around with proxy-dnssec and no dnssec in dnsmasq, but neither made a difference.

Also some external change happened to www.thecitizen.co.tz - it now seems to resolve successfully even with my dnsmasq->https_dns_proxy chain on the raspberry pi. bmeia.gv.at is still broken though. Mysteries :-)

There's no accute problem to solve anyhow. I set up my firefox to talk to Cloudflare DoH directly, so web browsing works de facto. Devices of other people who use my wifi (i.e. guests with their phones) don't run systemd-resolved.

May 12 '25 08:05 stefand

The problem came back with bmeia.gv.at with https_dns_proxy+dnsmasq provided by OpenWRT. While searching for hints I came across https://github.com/systemd/systemd/issues/34896 - it looks similar, although in that case it is about DNS over TLS implemented directly by systemd-resolved.

Jul 15 '25 09:07 stefand

I'm curious about what this might be also. Have you checked system time on all systems involved (including timezone)? DNSSEC depends on time and if you see it work and not work, my mind immediately jumps to potential time configuration issues.

On Tue, 15 Jul 2025 at 18:06, Stefan Dösinger @.***> wrote:

stefand left a comment (aarond10/https_dns_proxy#184) https://github.com/aarond10/https_dns_proxy/issues/184#issuecomment-3072805525

The problem came back with bmeia.gv.at with https_dns_proxy+dnsmasq provided by OpenWRT. While searching for hints I came across systemd/systemd#34896 https://github.com/systemd/systemd/issues/34896 - it looks similar, although in that case it is about DNS over TLS implemented directly by systemd-resolved.

— Reply to this email directly, view it on GitHub https://github.com/aarond10/https_dns_proxy/issues/184#issuecomment-3072805525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABTOXWN7TU6SK3MCOGKBQD3IS77VAVCNFSM6AAAAAB4LXOPNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANZSHAYDKNJSGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jul 20 '25 13:07 aarond10

I don't think it is a time problem - the clock on all my systems is reasonably well set.

$ date && ssh pi date && ssh root@tplink date && resolvectl query bmeia.gv.at So 20. Jul 19:45:23 EAT 2025 Sun 20 Jul 19:45:29 EAT 2025 Sun Jul 20 19:45:29 EAT 2025 bmeia.gv.at: resolve call failed: DNSSEC validation failed: failed-auxiliary (RRSIG Missing: for DNSKEY at., id = 1253)

("pi" is the raspberry pi that was previously running https-dns-proxy + dnsmasq. It is no longer involved in the lookup as I have migrated everything to openwrt running on host "tplink").

It seems that DNSSEC is a tricky subject, and systemd-resolved has plenty of DNSSEC related bugs open. https://github.com/systemd/systemd/issues/35126 and https://github.com/systemd/systemd/issues/35112 sound suspicious

Jul 20 '25 16:07 stefand

I'm attaching a zip file with 3 wireguard captures:

proxy-fail.pcapng is the "bad" setup: systemd-resolved ->dnsmasq -> https_dns_proxy -> cloudflare 1111.pcapng is a working case: systemd-resolved ->dnsmasq -> cloudflare 8888.pcapng is another working case: systemd-resolved ->dnsmasq -> google

In all 3 I executed "resolvectl query bmeia.gv.at" on 192.168.0.2. The resolvectl config (and everything else on this machine) is unchanged - the only difference is what upstream server dnsmasq on 192.168.0.1 talks to.

captures.zip

Jul 20 '25 17:07 stefand

One difference I spot between the proxy-fail case and 1111 is the initial response to DNSKEY bmeia.gv.at. In the bad case the response is 1476 bytes long. In the good case it is only 113 bytes. I don't really know anything about how DNS nor DNSSEC work, but my layman's reading is that in the bad case the server sends an entire certificate chain at once, whereas in the good case it sends the leaf certificate. In both cases systemd-resolved proceeds to as for DNSKEY entries for "gv.,at" and "at" and the responses at least have the same length in the good and bad cases.

I tried a few other upstream DoH servers that openwrt had in its preconfigured list: Google, LibreDNS and Applied Privacy DNS. They all fail in the same way, so it is not as simple as Cloudflare giving a bad response.

systemd-resolved talking directly to 1.1.1.1 either unencrypted or via DoT resolves bmeia.gv.at correctly. systemd-resolved doesn't support DoH, so I can't do an apples to apples comparison here.

Jul 20 '25 18:07 stefand

Which dnsmasq package is used at OpenWRT? IIRC only dnsmasq-full provides DNSSEC capabilities. It can be check with dnsmasq -v.

Jul 20 '25 18:07 bjmi

I have dnsmasq-full, which was necessary for some dhcp settings - in particular setting classless-static-routes for a few tagged clients.

In the past I experimented with the dnsssec and proxy-dnssec options in dnsmasq and they did not make a difference. I can retry if you think a particular configuration might change the behavior.

Jul 20 '25 18:07 stefand

Hi, some addition to "In the bad case the response is 1476 bytes long.": In proxy-fail.pcap requests "dns.rr.udp_payload_size == 1472" so the response is larger than client accepts. (Funny that in the reply there is "dns.rr.udp_payload_size == 1232" so server would like to receive smaller requests, but it sends larger than client requires.) In this situation DNS response should be truncated or the whole thing should go over TCP. The current https_dns_proxy does not support TCP yet. (See: https://github.com/aarond10/https_dns_proxy/issues/186) But if you have the time to test with beta TCP feature, please compile the proxy from my master code, eg. https://github.com/aarond10/https_dns_proxy/commit/95056ba76f46b9c4a893bbd67e924e45e4771944 My experience is, that when dnsmasq receives a larger response than acceptable, it replies with SERVFAIL. Br, Balázs

Jul 20 '25 20:07 baranyaib90

I'll give TCP a try in the evening. Note that my captures capture the systemd-resolved <-> dnsmasq communication, not dnsmasq<->systemd-resolved. I think both systemd-resolved and dnsmasq support DNS via TCP, so I'll try it there too.

Jul 21 '25 08:07 stefand

Yes, it seems systemd_resolved<->dnsmasq<->https_dns_proxy talking via TCP solves the problem - although I am not sure I did the test entirely correctly.

I built @baranyaib90 fork (at commit 24dd1ea39, which happens to be the current master), ran it and checked that something is listening at tcp:5053. Afaiu dnsmasq repeats incoming tcp requests over tcp, and udp over udp, and systemd-resolved prefers tcp if it gets replies. So running that https-dns-proxy, I can query the address that has caused me pain:

# systemctl restart systemd-resolved
# resolvectl query bmeia.gv.at
bmeia.gv.at: 80.120.70.125                     -- link: wlan0

Now I kill https-dns-proxy and start it with -T 0 to disable TCP support. I restart dnsmasq and systemd-resolved to wipe their caches and try again:

# systemctl restart systemd-resolved
# resolvectl query bmeia.gv.at
bmeia.gv.at: resolve call failed: DNSSEC validation failed: failed-auxiliary (RRSIG Missing: for DNSKEY at., id = 1253)

Mystery solved? My layman's interpretation is this: https_dns_proxy talks to Cloudflare via TCP, so Cloudflare sends a huge blob of certificates because why not. dnsmasq tries to retrieve and forward them via UDP and the huge blob gets truncated either between https_dns_proxy and dnsmasq or dnsmasq and systemd_resolved. systemd_resolved tries to validate the cert chain and fails.

Does that make sense? Is there anything https_dns_proxy can do about this, either tell Cloudflare to cut it short despite having a stream connection or better forwarding the certificates via UDP?

Fwiw my upstream internet is ipv4 only, which can in theory send oversized UDP/IP packets with fragmentation. My OpenWRT router and Linux box give themselves link-local ipv6 addresses, so they might talk via fragmentation-not-allowed ipv6 - but I don't think they do:

# resolvectl 
Global
           Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=allow-downgrade/unsupported
    resolv.conf mode: stub
Fallback DNS Servers: 1.1.1.1#cloudflare-dns.com 8.8.8.8#dns.google 1.0.0.1#cloudflare-dns.com 8.8.4.4#dns.google 2606:4700:4700::1111#cloudflare-dns.com
                      2001:4860:4860::8888#dns.google 2606:4700:4700::1001#cloudflare-dns.com 2001:4860:4860::8844#dns.google

Link 2 (eth0)
    Current Scopes: none
         Protocols: -DefaultRoute +LLMNR +mDNS -DNSOverTLS DNSSEC=allow-downgrade/supported
     Default Route: no

Link 3 (wlan0)
    Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 mDNS/IPv4 mDNS/IPv6
         Protocols: +DefaultRoute +LLMNR +mDNS -DNSOverTLS DNSSEC=allow-downgrade/unsupported
Current DNS Server: 192.168.0.1
       DNS Servers: 192.168.0.1
        DNS Domain: doe.home
     Default Route: yes

And when I filed the bug my router wasn't running OpenWRT yet, and there was no ipv6 in sight - neither into the public internet nor link-local internally.

Jul 21 '25 21:07 stefand

Hi, first of all: in case of DoH the server (cloudflare) may send 64k large response. General UDP DNS requests and responses should not be over limit 512, 1232, 1472, 4096 (etc, depends on EDNS0 option of client). When the dnsmasq received larger UDP response (than requested) from the https_dns_proxy, it most likely should have responded with SERVFAIL to systemd_resolved. The modified proxy from my master has 2 improvements:

it truncates large responses (so something gets thrown out of it) and sets the truncation flag on the response
it supports TCP, so larger responses from DoH upstream servers can be served correctly.

When a DNS client receives truncated response, it should ignore it and retry with TCP. That works up to 65k response.

IPv6 is independent of DNS request sizes. So that should not be the problem here.

Br, Balázs

Jul 22 '25 06:07 baranyaib90

I had some freebie agentic coding credits with kilocode and so I was playing with a TCP client mode and some minor security improvements also. I am happy to consider a human-made patch over an AI one though.

I am curious about the DNS requests you used to test this though. Could you share test vectors that we could use? Going forward, I would like to try to find some time to improve test coverage for this project that doesn't depend on external data but in the meantime having some good test requests for cases like this would be useful.

On Tue, 22 Jul 2025 at 16:19, baranyaib90 @.***> wrote:

baranyaib90 left a comment (aarond10/https_dns_proxy#184) https://github.com/aarond10/https_dns_proxy/issues/184#issuecomment-3101234095

Hi, first of all: in case of DoH the server (cloudflare) may send 64k large response. General UDP DNS requests and responses should not be over limit 512, 1232, 1472, 4096 (etc, depends on EDNS0 option of client). When the dnsmasq received larger UDP response (than requested) from the https_dns_proxy, it most likely should have responded with SERVFAIL to systemd_resolved. The modified proxy from my master has 2 improvements:

it truncates large responses (so something gets thrown out of it) and sets the truncation flag on the response

it supports TCP, so larger responses from DoH upstream servers can be served correctly.

When a DNS client receives truncated response, it should ignore it and retry with TCP. That works up to 65k response.

IPv6 is independent of DNS request sizes. So that should not be the problem here.

Br, Balázs

— Reply to this email directly, view it on GitHub https://github.com/aarond10/https_dns_proxy/issues/184#issuecomment-3101234095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABTOXS4CJ46TFVKCYQUDGL3JXJWNAVCNFSM6AAAAAB4LXOPNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCMBRGIZTIMBZGU . You are receiving this because you commented.Message ID: @.***>

Jul 28 '25 00:07 aarond10

The high level user input are the resolvectl query and dig commands sent to systemd-resolved in the first post. captures.zip has the systemd_resolved<->dnsmasq interaction.

I haven't yet had the time and opportunity to capture the dnsmasq<->https_dns_proxy interaction, I guess this would be the requests you are most interested in. I'll try it tonight - my family is out of the house, so I should be able to reduce unrelated chatter on my router.

Jul 28 '25 12:07 stefand

I finally got a capture on my router. In the attached file you see the systemd-resolved (running on 192.168.0.8) talking to dnsmasq (192.168.0.1:53), which is then in turn talking to https-dns-proxy (127.0.0.1:5053 aka 192.168.0.1:5053).

dns.pcapng.gz

Other IPs you will see in this log are 10.158.36.120: that's the IP of my wan interface. Yeah, CGnat sucks. 104.16.248.249 looks like the Cloudflare DoH server https-dns-proxy is talking to. Obviously the data there is encrypted. 192.168.0.6 is the machine that was running the remote capture via sshdump. Afaics those packages got filtered out correctly.

Random things I spotted are that systemd-resolved tries to connect to dnsmasq via TCP, which I think the latter doesn't respond to, so there are TCP retransmissions of the SYN packet. dnsmasq tries to talk to https-dns-proxy via TCP and (correctly) gets a RST back instantly. There are fragmented UDP packets going out.

Aug 10 '25 18:08 stefand

Based on the previous observation that systemd-resolved talking directly to https-dns-proxy (without dnsmasq in between) works, I tried to set up a firewall rule on my router to reject attempts to talk to tcp:53. This made the query for bmeia.gv.at fail faster and with a lot less chatter, but it still failed.

Re payload sizes, it seems that systemd-resolved sets a pax payload size of 1472 bytes. dnsmasq says 1232. https-dns-proxy sends back a packet with 1482 bytes and another one with 1675. On top of sending a larger than requested answer, both https-dns-proxy and dnsmasq set a payload size of 1232 bytes in their huge answers.

Aug 10 '25 18:08 stefand

Hi, I had the time to check your capture and observation and I see your point. I think, that DoH protocol servers ignore the requested EDNS0 UDP buffer size option (https://datatracker.ietf.org/doc/html/rfc6891#section-6.2.3), since DoH works over TCP. So this makes sense. The problem is that the https_dns_proxy "downgrades" the reply to UDP and without my changes (reply truncation and TCP serving) the proxy replies with too large response and dnsmasq does not tolerate it, but systemd-resolved does. I recommend to wait until my changes are merged to master (or use your own build of my changes) and your problem should be solved. Br, Balázs

Aug 16 '25 20:08 baranyaib90