pdns icon indicating copy to clipboard operation
pdns copied to clipboard

Inconsistent EDE data in dnstap

Open johnhtodd opened this issue 9 months ago • 2 comments

  • Program: dnsdist
  • Issue type: Bug report

Short description

Inconsistent messages in dnstap for EDE versus what is provided in query response

Environment

  • Operating system: redacted
  • Software version: redacted
  • Software source: pdns-rec 4.9.5, dnsdist (recent - less than 30 days, IIRC)

I'm looking at DNSSEC errors (coincidentally, in Amsterdam) for a day or so, and trying to figure out our classes of errors that are handed back in EDE which create a SERVFAIL towards the end user. I've trimmed down the error set - I excluded "No reachable authority" errors (which are rampant)

Here is the set from 24 hours excluding "no reachable authority", from a small sub-section of our AMS cluster.

┌─event.responseData.opt.ede.purpose─┬─errortype─┐
│ ['Network Error']                  │        11 │
│ ['Unsupported DNSKEY Algorithm']   │      1520 │
│ ['Signature Expired']              │     37544 │
│ ['DNSKEY Missing']                 │     49717 │
│ ['RRSIGs Missing']                 │     56164 │
│ ['NSEC Missing']                   │     60216 │
│ ['Synthesized']                    │     75301 │
│ ['DNSSEC Bogus']                   │     78056 │
│ ['Other Error']                    │    277917 │
└────────────────────────────────────┴───────────┘

So what are all those "other error" items? This seems to be an unusually large number in the "catchall" category.

I dug into this a bit, and I need some sanity checking, or perhaps this is a bug.

I found a domain that is coming up with "other error" as reported in the dnstap data set - tracker.publicbt.com. There are ~6000 of those in one of my logfiles, so I figured it would be a good test.

When I look at dnsviz, this is a "refused" error, and sure enough when I do a "dig" I get a no reachable authority result:

jtodd@dev01:~$ dig @9.9.9.9 tracker.publicbt.com

; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> @9.9.9.9 tracker.publicbt.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 14781
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; EDE: 22 (No Reachable Authority): (delegation publicbt.com)
;; QUESTION SECTION:
;tracker.publicbt.com.		IN	A

;; Query time: 0 msec
;; SERVER: 9.9.9.9#53(9.9.9.9) (UDP)
;; WHEN: Sun May 26 16:48:38 UTC 2024
;; MSG SIZE  rcvd: 78

jtodd@dev01:~$

But when I look through the dnstap logs, I find that they are not being listed as "no reachable authority" but in fact are showing up as "other error" (info code 0). I find no events in the dnsstap output that shows "no reachable authority" for that name, even though the name appears hundreds of times. All of the errors are "other error" which seems to not match what I see in my actual query results.

I am collecting the data from dnstap, which is sent by dnsdist. pdns-rec is of course behind dnsdist, along with (as usual) unbound, which we currently do not have sending ede results (therefore, unbound answers never appear with any EDE data set, so they are not considered in my searches.)

Is this a dnsdist error with dnstap? Or is this a method problem?

johnhtodd avatar May 26 '24 17:05 johnhtodd